This paper presents a statistical analysis of an experiment focusing on morphological case in Japanese. The study explores whether there is a significant difference in the acceptability of a specific type of Japanese copular sentence when used in two distinct contexts. Preliminary informal data collected by the author suggested that context plays a role in determining sentence acceptability, though individual variability was also observed (Harada 2018). To address this, a formal experiment with a larger sample size was conducted to draw more reliable conclusions. It will be shown that the contextual effect on the sentence acceptability is statistically significant at the traditional α = 0.05 level.
The paper is structured as follows: Section 2 provides a brief explanation of the phenomenon under investigation. Section 3 outlines the experimental design. Section 4 presents the statistical analysis of the experiment’s results. Finally, Section 5 concludes the paper. Note that Sections 2-3 are adapted from Harada (2024) for readability. As a result, readers familiar with Harada (2024) may skip directly to Section 4.
The experiment explored a phenomenon in Japanese copular constructions where the predicate can sometimes take an accusative case marker and sometimes cannot. Sentence (1) illustrates this phenomenon:
In (1), the predicate onigiri-o mit-tu ‘three rice balls’ can optionally take the accusative case marker -o. The acceptability of the accusative case depends on the context in which the sentence is uttered. For example, sentence (1) sounds natural in the context provided in (2a) but not in the context given in (2b):
Ken is Ai’s father and always cooks lunch for her. It’s now 6am, and Ai has just come into the kitchen. Ken says (1) to Ai.
Ken and Ai have long been monitoring when different kinds of food in their showcase go bad. Ken always checks at 4pm which food items and how many of them have spoiled. It’s now 4pm, and Ai has just come to the showcase. Looking at the food, Ken says (1) to Ai.
This variability in the acceptability of the accusative case across contexts raises several descriptive and theoretical questions. One key question is: what kinds of contexts allow the accusative case? In response, Harada (2018) offers a descriptive generalization, summarized in (3):
I refer to wh-questions that meet the conditions in (3) as whAcc-questions.
I demonstrate that while the context in (2a) accommodates a whAcc-question, the context in (2b) does not. As a result, only the context in (2a) allows the predicate accusative case. First, consider the whAcc-question accommodated in (2a), shown in (4):
The question in (4), which contains an accusative case-marked wh-item, is contextually salient because Ken cooks lunch for Ai every morning, and (1) is uttered in the morning. Additionally, the copular sentence in (1) answers the question in (4). Thus, (4) is a whAcc-question accommodated in (2a).
In contrast, it is difficult to imagine a whAcc-question in context (2b). The most natural wh-question to accommodate in (2b) that is answered by (1) would be (5). However, this question does not contain an accusative case-marked wh-item, so it is not a whAcc-question:
To summarize, the availability of the predicate accusative case depends on the utterance context and appears to be governed by the conditions in (3). However, as mentioned in Section 1, there are some differences in acceptability among individuals. Therefore, the proposed experiment examines whether (3) is a correct generalization of when the predicate accusative case is available.
I recruited 94 native Japanese speakers for the experiment through a platform called CrowdWorks. The initial target sample size of 88 participants was determined by conducting a simulation to avoid underestimating the sample size and reduce the risk of Type II errors, as well as Type M and Type S errors. The simulation was conducted using a created dataframe and a linear mixed-effects model, which were based on acceptability judgment data from Linzen and Oseki (2018). The simulation suggested that with 88 participants, the experiment would have a power of 0.83 (95% confidence interval: 0.82, 0.84). For more details on the power calculation, see Harada (2024).
During the recruitment process, more participants than requested responded to the experiment invitation on CrowdWorks. While 88 participants officially submitted the task on the platform, complete responses/ratings from 94 participants were recorded in my database. After reviewing the data, I decided to include all 94 participants in the analysis for two reasons: (1) excluding additional participants who provided complete and valid responses would unnecessarily reduce the sample size, and (2) some participants from the original 88 had to be excluded due to completing the experiment too quickly or providing unusual ratings for fillers/anchors. Including the additional participants helped maintain a robust sample size for analysis.
Eight participants were excluded based on the following criteria: (1) the short time they spent completing the experiment, (2) low standard deviation of ratings, and/or (3) unusual ratings of fillers and anchor items. The detailed definition of unusual ratings and the exclusion process are discussed in Section 4.1. As a result of these exclusions, the current study has responses from 86 participants, based on which the effect of whAcc-question will be explored in this paper.
Following Linzen and Oseki (2018), I recruited participants who met two conditions: (i) they lived in Japan from birth until at least age 13, and (ii) their parents spoke Japanese to them at home. I recruited speakers of any dialect as it was not clear whether speakers of different Japanese dialects might show varying patterns in acceptability judgments for sentences in the Tokyo dialect used in this study. As a result, the experiment included participants who speak various dialects, such as Kanto dialects (37 participants), Kansai dialect (17 participants), and other regional dialects. Since the sample size of speakers of each dialect is small except for Kanto dialects, I decided to divide the participants into speakers of Eastern Japanese dialect (47 participants) and Western Japanese dialect (38 participants). The total is 85 instead of 86 because one participant reported speaking both Eastern and Western Japanese dialects, and it was unclear which dialect was more prominent. It should also be noted that when dividing the participants into speakers of Eastern and Western Japanese dialects, I assigned speakers of Gifu, Mikawa, and Nagoya dialects (5 speakers in total) to the group of Western dialect speakers. While such speakers are sometimes considered Eastern Japanese dialect speakers, I made this decision because fewer speakers of the Western dialect were recruited in the current study. It will be shown that there is a marginally significant difference between Eastern and Western dialect speakers in their ratings of main sentences at the traditional α = 0.05 level.
Similarly, I included both linguists and non-linguists in the experiment. While there is some evidence suggesting that these two groups may provide different judgments (e.g., Spencer 1973; Gordon and Hendrick 1997; Dabrowska 2010), it remains unclear whether linguists should avoid collecting data from fellow linguists (e.g., Schütze and Sprouse 2014). On one hand, linguists’ judgments may be influenced by their theoretical perspectives (e.g., Edelman and Christiansen 2003; Wasow and Arnold 2005; Gibson and Fedorenko 2013), but on the other hand, they might also be more attuned to subtle differences in acceptability that non-linguists might miss (e.g., Newmeyer 1983; Grewendorf 2007). I asked participants in the questionnaire whether they had studied linguistics. It turns out that of the 86 participants, only 4 reported having studied linguistics. Among these, 3 showed a large difference in ratings between condition 0 and condition 1 sentences. However, the small sample size of participants with linguistics training limits the generalizability of this observation.
Lastly, the experiment also asked participants about the foreign languages they had studied. Some participants reported studying languages, including Chinese, English, French, German, Korean, and/or Spanish. Among these participants, 21 seemed to be able to manage basic conversation in English, and one participant each for Chinese and Spanish, based on the following certificates/experiences (no participants who reported studying French, German, or Korean provided information about their proficiency in those languages):
It will be shown that foreign language experience is a marginally significant effect on the rating of main sentences at the traditional α = 0.05 level. For simplicity, this paper refers to participants who can manage basic conversation in Chinese, English, or Spanish as bilingual speakers/participants, in contrast to other participants, who are referred to as monolingual speakers/participants.
In Section 4.2, we will briefly touch on how participants’ native dialects, linguistics background, and foreign language experiences affect their ratings of main items in the experiment.
The experiment involved 10 main items, each consisting of the same sequence of Japanese copular sentences uttered in two different contexts, resulting in 20 main sentences. For example, one item included sentence (1) in context (2a) and sentence (1) in context (2b). To minimize the effect of particular lexical items, we used 10 main items, consistent with Schütze and Sprouse’s (2014) recommendation to include at least 8 items to reduce lexical effects. While more items could increase the experiment’s power, 10 items were used to prevent participant fatigue or boredom.
The 10 main items were divided into two lists—List 1 and List 2—using a Latin square design. For example, if List 1 included item 1-condition 0 (main items predicted to sound natural), List 2 included item 1-condition 1(main items predicted to sound unnatural). Each participant was exposed to one version of either List 1 or List 2, ensuring that each participant encountered only one condition per item, resulting in exposure to 10 main sentences.
To ensure pseudorandomization, the main items were ordered such that at least one filler item separated any two main items, and acc.ok (main items predicted to sound natural) alternated with acc.bad (main items predicted to sound unnatural), with filler items interspersed. Twice as many filler sentences as main sentences were included, following @Cowart1997experimental, for three reasons: (i) to encourage use of the full range of the 7-point Likert scale (1 = least natural, 7 = most natural), (ii) to include diverse constructions to prevent influence from salient features of the main items, and (iii) to obscure the experiment’s intent. In addition to main and filler items, three anchor sentences were included at the beginning of each item set in the same order across all participants. These anchor sentences were designed to encourage use of the full scale and prevent biases such as skew or compression.
Each list consisted of four versions of item sets (i.e., main items, fillers, and anchors), created by counterbalancing the order of items to mitigate potential effects such as fatigue or response biases. These four orders were: original, reversed, split, and split-reversed. Thus, the final design resulted in eight versions of item sets, divided into List 1 and List 2, with four versions per list. Each participant was exposed to one version, viewing 33 sentences in total: 3 anchor sentences, 10 main sentences, and 20 filler sentences.
The table below illustrates the structure of these item sets, including the placement of main items (acc.ok and acc.bad), filler items (F), and anchor sentences (A):
| List1.original |
|---|
| A1 |
| A2 |
| A3 |
| acc.ok.03 |
| F03 |
| F16 |
| F09 |
| acc.bad.04 |
| F01 |
| F17 |
| acc.ok.07 |
| F15 |
| acc.bad.08 |
| F19 |
| acc.ok.09 |
| F08 |
| F14 |
| acc.bad.06 |
| F18 |
| F12 |
| F11 |
| F05 |
| F20 |
| F04 |
| acc.ok.05 |
| F07 |
| F02 |
| F06 |
| acc.bad.02 |
| F10 |
| acc.ok.01 |
| F13 |
| acc.bad.10 |
| List1.reversed |
|---|
| A1 |
| A2 |
| A3 |
| acc.bad.10 |
| F13 |
| acc.ok.01 |
| F10 |
| acc.bad.02 |
| F06 |
| F02 |
| F07 |
| acc.ok.05 |
| F04 |
| F20 |
| F05 |
| F11 |
| F12 |
| F18 |
| acc.bad.06 |
| F14 |
| F08 |
| acc.ok.09 |
| F19 |
| acc.bad.08 |
| F15 |
| acc.ok.07 |
| F17 |
| F01 |
| acc.bad.04 |
| F09 |
| F16 |
| F03 |
| acc.ok.03 |
| List1.split |
|---|
| A1 |
| A2 |
| A3 |
| F18 |
| F12 |
| F11 |
| F05 |
| F20 |
| F04 |
| acc.ok.05 |
| F07 |
| F02 |
| F06 |
| acc.bad.02 |
| F10 |
| acc.ok.01 |
| F13 |
| acc.bad.10 |
| F03 |
| F16 |
| acc.ok.03 |
| F09 |
| acc.bad.04 |
| F01 |
| F17 |
| acc.ok.07 |
| F15 |
| acc.bad.08 |
| F19 |
| acc.ok.09 |
| F08 |
| F14 |
| acc.bad.06 |
| List1.split-reversed |
|---|
| A1 |
| A2 |
| A3 |
| acc.bad.06 |
| F14 |
| F08 |
| acc.ok.09 |
| F19 |
| acc.bad.08 |
| F15 |
| acc.ok.07 |
| F17 |
| F01 |
| acc.bad.04 |
| F09 |
| acc.ok.03 |
| F16 |
| F03 |
| acc.bad.10 |
| F13 |
| acc.ok.01 |
| F10 |
| acc.bad.02 |
| F06 |
| F02 |
| F07 |
| acc.ok.05 |
| F04 |
| F20 |
| F05 |
| F11 |
| F12 |
| F18 |
| List2.original |
|---|
| A1 |
| A2 |
| A3 |
| acc.bad.03 |
| F07 |
| acc.ok.04 |
| F08 |
| F17 |
| F19 |
| F15 |
| F11 |
| F20 |
| acc.bad.07 |
| F06 |
| acc.ok.10 |
| F13 |
| acc.bad.05 |
| F01 |
| F10 |
| F16 |
| F05 |
| acc.ok.06 |
| F04 |
| F18 |
| acc.bad.01 |
| F02 |
| acc.ok.08 |
| F09 |
| F03 |
| acc.bad.09 |
| F12 |
| F14 |
| acc.ok.02 |
| List2.reversed |
|---|
| A1 |
| A2 |
| A3 |
| acc.ok.02 |
| F14 |
| F12 |
| acc.bad.09 |
| F03 |
| F09 |
| acc.ok.08 |
| F02 |
| acc.bad.01 |
| F18 |
| F04 |
| acc.ok.06 |
| F05 |
| F16 |
| F10 |
| F01 |
| acc.bad.05 |
| F13 |
| acc.ok.10 |
| F06 |
| acc.bad.07 |
| F20 |
| F11 |
| F15 |
| F19 |
| F17 |
| F08 |
| acc.ok.04 |
| F07 |
| acc.bad.03 |
| List2.split |
|---|
| A1 |
| A2 |
| A3 |
| F10 |
| F16 |
| F05 |
| acc.ok.06 |
| F04 |
| F18 |
| acc.bad.01 |
| F02 |
| acc.ok.08 |
| F09 |
| F03 |
| acc.bad.09 |
| F12 |
| acc.ok.02 |
| F14 |
| acc.bad.03 |
| F07 |
| acc.ok.04 |
| F08 |
| F17 |
| F19 |
| F15 |
| F11 |
| F20 |
| acc.bad.07 |
| F06 |
| acc.ok.10 |
| F13 |
| acc.bad.05 |
| F01 |
| List2.split-reversed |
|---|
| A1 |
| A2 |
| A3 |
| F01 |
| acc.bad.05 |
| F13 |
| acc.ok.10 |
| F06 |
| acc.bad.07 |
| F20 |
| F11 |
| F15 |
| F19 |
| F17 |
| F08 |
| acc.ok.04 |
| F07 |
| acc.bad.03 |
| F14 |
| acc.ok.02 |
| F12 |
| acc.bad.09 |
| F03 |
| F09 |
| acc.ok.08 |
| F02 |
| acc.bad.01 |
| F18 |
| F04 |
| acc.ok.06 |
| F05 |
| F16 |
| F10 |
This table demonstrates how pseudorandomization and counterbalancing were implemented to ensure that each participant received a well-balanced set of items. For additional details about the creation of the item sets and their randomized ordering, see Harada (2024).
Since this experiment involved 86 participants and each participant was exposed to 10 main sentences, acceptability ratings for 860 main sentences should have been collected. However, it turned out that two participants rated certain filler items twice and missed rating a specific main item.1 As a result, acceptability ratings for 858 main sentences were collected.
The experiment took place online. Participants first answered a questionnaire about their language background and familiarity with linguistics. Then, they were introduced to the concept of sentence naturalness as used in this experiment. The instructions emphasized that participants should disregard prescriptive grammar rules, the likelihood of the sentence being spoken in real life, and the plausibility of the sentence content. Following Schütze and Sprouse (2014), the experiment described these points by providing the Japanese counterparts of the following sentences:
After the explanation, participants were presented with three anchor sentences: one that is clearly natural (acceptability = 6~7), one that is clearly unnatural (acceptability = 1~2), and one with controversial acceptability (acceptability = 3~5). Following this, participants proceeded to rate the main and filler items.
The experiment took around 30 minutes on average, including instructions and brief questionnaires.
This section analyzes the results of the experiment. First, after setting up the working directory where the CSV file containing the experiment results is located, we read the CSV file using tidyverse.
# Load a package to use read_csv().
library(tidyverse)
# Read a CSV file.
experimentResult_original <- read_csv('experimentResult.csv')
Using dplyer, we first check the average ratings of three anchor sentences.
# Load packages for filter(), group_by(), summarise(), etc.
library(dplyr)
# Calculate the average ratings of anchor sentences.
averageRatings_practice <- experimentResult_original %>%
filter(experiment_id == "Practice") %>%
group_by(item_id) %>%
summarise(average_rating = mean(rating, na.rm = TRUE), .groups = "drop")
# Print the result of the calculation.
print(averageRatings_practice, n = Inf)
## # A tibble: 3 × 2
## item_id average_rating
## <dbl> <dbl>
## 1 1 6.09
## 2 2 2.42
## 3 3 4.09
Anchor sentences 1, 2, and 3 are designed to be rated 6-7, 1-2, and 3-5, respectively, to help participants understand the correspondence between the 7-point Likert scale values and the naturalness of sentences. As you can see above, each of the anchor sentences was rated appropriately.
Next, we turn to examining the average ratings of condition 0 and condition 1 for the main items.
# Calculate the average ratings of condition 0 and 1 of main items.
averageRatings_averageAccCase <- experimentResult_original %>%
filter(experiment_id == "AccCase") %>%
group_by(condition) %>%
summarise(average_rating = mean(rating, na.rm = TRUE), .groups = "drop")
print(averageRatings_averageAccCase, n = Inf)
## # A tibble: 2 × 2
## condition average_rating
## <dbl> <dbl>
## 1 0 5.66
## 2 1 2.27
The condition 0 sentences are rated higher than the condition 1 sentences on average by 3.39. The sentences in the two conditions appear to differ significantly, given that the difference between the clearly natural anchor sentence 1 (6.09) and the clearly unnatural anchor sentence 2 (2.42) is 3.67.
To further explore the difference between sentences in condition 0 and condition 1, we examine the difference between the two conditions for each main item.
# Calculate the average ratings and differences between condition 0 and condition 1 for each item.
itemBy_averageRating_accCase <- experimentResult_original %>%
filter(experiment_id == "AccCase") %>%
group_by(item_id) %>%
summarise(
avgRating_condition0 = mean(rating[condition == 0], na.rm = TRUE),
avgRating_condition1 = mean(rating[condition == 1], na.rm = TRUE),
diff_condition_0_1 = avgRating_condition0 - avgRating_condition1
)
# Custom labeller to add "item" before the item number
custom_labeller <- function(labels) {
labels$item_id <- paste0("item ", labels$item_id)
labels
}
# Visualize raw data points and linear trend lines by condition for each item.
ggplot(experimentResult_original %>% filter(experiment_id == "AccCase"), aes(x = factor(condition), y = rating)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
facet_wrap(~ item_id, labeller = custom_labeller) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) + # Restrict x-axis to "0" and "1"
labs(
x = "Condition",
y = "Rating",
color = "Condition",
title = "Trend Lines and Raw Data by Condition for Each Item"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.1: Trend lines and individual data points showing rating distributions across experimental conditions for all test items. Pink dots (condition 0) consistently cluster at higher ratings (around 5-6), while teal dots (condition 1) cluster at lower ratings (around 1-2). The blue trend lines demonstrate a consistent downward pattern across all ten items, indicating a robust effect of the experimental manipulation that is independent of lexical variation.
The figure shows the trend lines and raw data points for ratings of each item under the two experimental conditions, labeled as 0 and 1. For each item, the ratings in condition 0 appear to be generally higher than in condition 1, with the trend line indicating a downward slope from condition 0 to condition 1. This suggests that the experimental manipulation had a consistent effect on participant ratings and that the overall average ratings of condition 0 and 1 sentences (5.66 and 2.27, respectively) are due to the difference in condition, not caused by specific lexical items used in the sentences.
Lastly, we examine the difference between the two conditions for each participant.
participantBy_averageRating_accCase <- experimentResult_original %>%
filter(experiment_id == "AccCase") %>%
group_by(participant_id) %>%
summarise(
avgRating_condition0 = mean(rating[condition == 0], na.rm = TRUE),
avgRating_condition1 = mean(rating[condition == 1], na.rm = TRUE),
diff_condition_0_1 = avgRating_condition0 - avgRating_condition1
) %>%
arrange(desc(diff_condition_0_1))
# Install a package to use mixedsort().
library(gtools)
library(forcats)
ggplot(
experimentResult_original %>%
filter(experiment_id == "AccCase") %>%
mutate(participant_id = fct_relevel(participant_id, mixedsort(unique(participant_id)))),
aes(x = factor(condition), y = rating)
) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
facet_wrap(~ participant_id, ncol = 8) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) + # Restrict x-axis to "0" and "1"
#scale_y_continuous(limits = c(0, 7), breaks = seq(0, 7, 1)) + # Set y-axis from 0 to 7 with breaks
scale_y_continuous(limits = c(-0.1, 7.1)) +
labs(
x = "Condition",
y = "Rating",
color = "Condition",
title = "Trend Lines and Raw Data by Condition for Each Participant"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5) # Ensure proper alignment for "0" and "1"
)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 90 rows containing missing values or values outside the scale range
## (`geom_point()`).
Figure 4.2: Trend lines and raw daata across the two experimental conditions for each participant. The majority of participants exhibit a downward trend in the measured variable from condition 0 to condition 1.
The figure presents trend lines and raw data for each participant (P1 to P102) across the two experimental conditions (0 and 1). Overall, the trend lines for most participants indicate a downward slope from condition 0 to condition 1, suggesting a decrease in the measured variable. However, there are a few exceptions. The trend line for participant P26 shows an upward slope, indicating an increase in the measured variable from condition 0 to condition 1. This participant’s response pattern appears to deviate from the majority. While participants whose ratings deviate from the overall trend are not necessarily outliers, the case of P26 is worth examining further to understand the reasons behind the different response pattern.
Additionally, participant P45 seems to have provided ratings for only a few data points. This suggests that P45 may not have completed the full experiment. In fact, there are likely more participants like P45, who did not complete the full experiment and thus are not included in the figure above. We will examine these participants as well in the next subsubsection.
In the rest of this section, we will examine the difference between condition 0 and condition 1 sentences in more detail. To do so, Section 4.1 outlines the steps taken to preprocess the experimental data, including the exclusion of unusual participants and other refinements, to ensure the dataset is suitable for subsequent statistical analysis. Section 4.2 then conducts the statistical analysis of the main items.
First, some participants did not complete the experiment, as shown below.
# Identify participants who rated 32 or fewer sentences (total number of sentences was 33 sentences), and the number of ratings they provided.
participants_incomplete <- experimentResult_original %>%
group_by(participant_id) %>%
summarise(ratings_count = n()) %>%
filter(ratings_count < 33)
print(participants_incomplete)
## # A tibble: 8 × 2
## participant_id ratings_count
## <chr> <int>
## 1 P30 4
## 2 P31 3
## 3 P32 4
## 4 P37 6
## 5 P39 1
## 6 P45 5
## 7 P63 3
## 8 P98 32
The eight participants listed above did not complete the experiment, so they should be excluded from the dataframe and not considered in the subsequent analysis. Note that for this reason, most of these participants are not included in Figure 4.2.
# Filter out participants with fewer than 33 ratings.
experimentResult <- experimentResult_original %>%
group_by(participant_id) %>%
filter(n() == 33) %>%
ungroup()
Next, we examine how long participants spent on the experiment. Since the experiment investigates the contextual effect on the availability of the accusative case, it is essential for participants to carefully read both the contexts and the sentences whose acceptability is rated under those contexts. Therefore, we need to exclude participants who completed the experiment unusually quickly compared to most other participants. First, consider the following violin plot to observe the overall pattern of time spent by participants.
# Load the package for ymd_hms().
library(lubridate)
# Calculates the total time each participant spent on the experiment.
timeSpent <- experimentResult %>%
group_by(participant_id) %>%
summarise(
start_time = min(ymd_hms(timestamp)),
end_time = max(ymd_hms(timestamp)),
time_spent = difftime(max(ymd_hms(timestamp)), min(ymd_hms(timestamp)), units = "mins"),
.groups = "drop"
) %>%
arrange(time_spent)
ggplot(timeSpent, aes(x = "", y = time_spent)) +
geom_violin(fill = "skyblue", color = "black", alpha = 0.7) +
theme_minimal() +
labs(x = "", y = "Time Spent (minutes)")
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
Figure 4.3: Distribution of total time spent by participants in completing the experiment.
The violin plot shows the time participants spent from the end of the first item to the end of the last item—i.e., the time participants spent answering 32 items instead of the total 33 items in the experiment. According to the violin plot, most participants spent around 12 minutes on the experiment. The top part of the plot, which represents participants who spent a long time on the experiment, stands out, but here we are concerned with participants who spent a short amount of time. Such participants may not have read the contexts or texts carefully, whereas spending a long time on the experiment is not necessarily a sign of insincere participation.
To examine individual differences in the time spent on the experiment in more detail, see the following scatter plot.
ggplot(timeSpent, aes(x = participant_id, y = time_spent)) +
geom_boxplot() +
theme_minimal() +
labs(x = "", y = "Time Spent (minutes)") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
Figure 4.4: Distribution of total experiment completion times, with each data point representing an individual participant.
At first glance, it seems that the three participants who spent around three minutes are unusual. To confirm this hypothesis, we compare these participants with others in the first quantile in terms of time spent on the experiment (i.e., the 25% of participants who spent the least time on the experiment). To do this, we first calculate the first quantile (Q1) of the “Time Spent” in Figure 4.4.
# Calculate the first quantile (Q1) of the time_spent column
quantile1 <- quantile(timeSpent$time_spent, probs = 0.25)
# Convert Q1 to a numeric value
quantile1_numeric <- as.numeric(quantile1)
print(quantile1_numeric)
## [1] 9.305765
Based on the Q1 value, I created the plot in 4.5, which displays cumulative response times. The x-axis shows time thresholds in seconds (up to about 60 seconds), and the y-axis shows the number of questions answered within each time threshold (up to about 32 questions). Each colored line represents a different participant (labeled, e.g., P100), with a total of 21 participants in the Q1 shown.
library(RColorBrewer) # For better color palettes
# Function to process a participant's data
process_participant_data <- function(data, participant_id) {
# Calculate time differences
times <- data %>%
filter(participant_id == !!participant_id) %>%
arrange(timestamp) %>%
pull(timestamp) %>%
as.POSIXct()
# Get total duration in minutes
total_duration <- as.numeric(difftime(max(times), min(times), units = "mins"))
# Only process if duration is <= 9.3 minutes (Quantile 1 value)
if (total_duration <= 9.3) {
# Calculate time differences in seconds
time_diffs <- diff(as.numeric(times))
# Create cumulative counts for each second threshold
max_diff <- ceiling(max(time_diffs))
breaks <- seq(1, max_diff)
counts <- sapply(breaks, function(x) sum(time_diffs <= x))
# Return data frame
return(data.frame(
seconds = breaks,
count = counts,
participant = participant_id
))
}
return(NULL)
}
# Process all participants
all_participants_data <- lapply(
unique(experimentResult$participant_id),
function(pid) process_participant_data(experimentResult, pid)
) %>%
bind_rows()
# Create the plot with distinct colors
ggplot(all_participants_data, aes(x = seconds, y = count, color = participant)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
scale_color_manual(
values = colorRampPalette(
c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00",
"#FFFF33", "#A65628", "#F781BF", "#999999", "#66C2A5",
"#FC8D62", "#8DA0CB", "#E78AC3", "#A6D854", "#FFD92F")
)(length(unique(all_participants_data$participant)))
) +
labs(
title = "Cumulative Distribution of Response Times by Participant",
x = "Time Threshold (seconds)",
y = "Number of Questions Answered Within Threshold",
color = "Participant ID"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
legend.position = "right",
legend.text = element_text(size = 10),
legend.title = element_text(size = 12)
)
Figure 4.5: Cumulative distribution of response times for participants in the first quantile of time spent on the experiment. Each line represents a different participant (n=21), showing the number of questions answered (y-axis) within each time threshold in seconds (x-axis).
The majority of participants took between 10–30 seconds per question. On the other hand, participants P96 (yellow line), P24 (dark gray line), and P94 (light green line) were notably faster, spending 5–10 seconds per question. Therefore, we will examine these unusual participants in more detail.
To determine whether the unusual participants’ reading speeds are reasonable, we need to know: (i) how many characters each item consists of, (ii) how quickly fast Japanese readers can read each item, and (iii) how quickly the unusual participants read each item. We will first address these questions with respect to P96 and P94, who were assigned List 1, while P24 was assigned List 2 (i.e., different items) in the experiment, as shown below.
# Get experiment versions for multiple participants
experiment_versions <- experimentResult %>%
filter(participant_id %in% c("P96", "P94", "P24")) %>%
group_by(participant_id) %>%
summarise(experiment_version = unique(experiment_version))
print(experiment_versions)
## # A tibble: 3 × 2
## participant_id experiment_version
## <chr> <chr>
## 1 P24 list1-split
## 2 P94 list2-splitReversed
## 3 P96 list2-splitReversed
The table below lists each item’s word count, how quickly fast Japanese readers can read the texts in the item, and how quickly P96 and P94 read them. I will elaborate on this table below.
| List2 Items | Context Characters | Dialogue Characters | Total Characters | Fast Reader’s Time (s) | P96’s Time - 1 (s) | P94’s Time - 1 (s) |
|---|---|---|---|---|---|---|
| P01 | 19 | 27 | 46 | 2.00 | N/A | N/A |
| P02 | 19 | 26 | 45 | 1.96 | 1.07 | 4.39 |
| P03 | 114 | 100 | 214 | 9.30 | 3.04 | 4.56 |
| F01 | 83 | 65 | 148 | 6.43 | 1.95 | 3.69 |
| acc.bad.05 | 22 | 63 | 85 | 3.70 | 1.67 | 4.88 |
| F13 | 32 | 72 | 104 | 4.52 | 4.58 | 4.08 |
| acc.ok.10 | 275 | 19 | 294 | 12.78 | 6.61 | 10.51 |
| F06 | 132 | 19 | 151 | 6.57 | 4.52 | 7.06 |
| acc.bad.07 | 150 | 43 | 193 | 8.39 | 5.53 | 5.05 |
| F20.high | 111 | 86 | 197 | 8.57 | 5.49 | 7.63 |
| F11 | 147 | 21 | 168 | 7.30 | 4.73 | 3.36 |
| F15 | 57 | 27 | 84 | 3.65 | 3.58 | 6.09 |
| F19.low | 156 | 14 | 170 | 7.39 | 3.88 | 3.48 |
| F17.low | 84 | 47 | 131 | 5.70 | 3.74 | 3.43 |
| F08 | 148 | 99 | 247 | 10.74 | 6.34 | 6.08 |
| acc.ok.04 | 139 | 21 | 160 | 6.96 | 2.90 | 2.25 |
| F07 | 82 | 63 | 145 | 6.30 | 4.12 | 4.37 |
| acc.bad.03 | 200 | 53 | 253 | 11.00 | 4.55 | 5.64 |
| F14 | 93 | 23 | 116 | 5.04 | 2.85 | 7.64 |
| acc.ok.02 | 202 | 79 | 281 | 12.17 | 4.60 | 5.69 |
| F12 | 134 | 30 | 164 | 7.13 | 3.00 | 4.24 |
| acc.bad.09 | 80 | 80 | 160 | 6.96 | 5.35 | 10.01 |
| F03 | 170 | 85 | 255 | 11.09 | 6.45 | 8.95 |
| F09 | 73 | 116 | 189 | 8.22 | 4.94 | 4.24 |
| acc.ok.08 | 102 | 54 | 156 | 6.78 | 3.93 | 2.34 |
| F02 | 29 | 40 | 69 | 3.00 | 2.84 | 3.48 |
| acc.bad.01 | 97 | 58 | 155 | 6.74 | 2.83 | 3.83 |
| F18.high | 212 | 18 | 230 | 10.00 | 3.55 | 2.78 |
| F04 | 121 | 25 | 146 | 6.35 | 3.55 | 6.19 |
| acc.ok.06 | 84 | 57 | 141 | 6.13 | 3.29 | 5.68 |
| F05 | 50 | 61 | 111 | 4.83 | 2.73 | 7.55 |
| F16 | 49 | 68 | 117 | 5.09 | 2.00 | 4.23 |
| F10 | 35 | 100 | 135 | 5.87 | 2.50 | 6.69 |
The first four columns list the item IDs and the number of characters in the texts for each item.
The fifth column, “Fast Reader’s Time (s)”, represents how many seconds it takes for fast Japanese readers to read the number of characters listed in the “Total Characters” column. This value is calculated as follows: The value in the “Total Characters” column is divided by 23, and the result is rounded to the second decimal place. The reason for dividing by 23 is as follows: In Kobayashi and Kawashima’s (2018) experiment, where participants were asked to read Japanese texts at a normal pace without skimming,2 the reading speed followed a unimodal distribution, ranging from approximately 300 to 1200 characters per minute, with a mean of 653 characters per minute, a median of 635 characters per minute, and a standard deviation of 174 characters per minute.3 Based on this finding, this paper assumes that participants P96, P94, and P24 can read 1400 characters per minute, or approximately 23 characters per second (≈ 1400/60). This assumption accounts for the possibility that these participants might read faster than those in Kobayashi and Kawashima’s (2018) experiment. This conservative assumption is made to avoid incorrectly concluding that these participants did not engage sincerely with the experiment and should be excluded from the statistical analysis.
The last two columns show how many seconds participants P96 and P94 are considered to have spent reading the texts in each item, excluding the first anchor item. First, I list below how many seconds these participants spent on each item, excluding the first anchor item.4
# Identify how long P96 spent for each item except the first anchor item
times_P96 <- experimentResult %>%
filter(participant_id == "P96") %>%
arrange(timestamp) %>%
pull(timestamp) %>%
as.POSIXct()
time_diffs_P96 <- diff(as.numeric(times_P96))
print(time_diffs_P96)
## [1] 2.072585 4.036435 2.952156 2.669867 5.583200 7.610973 5.523804 6.527529
## [9] 6.487053 5.730832 4.576248 4.881329 4.742683 7.337471 3.897416 5.116142
## [17] 5.545466 3.848570 5.598097 4.002965 6.348417 7.450092 5.935169 4.934051
## [25] 3.841895 3.831228 4.550519 4.554291 4.288207 3.733359 3.003003 3.504836
times_P94 <- experimentResult %>%
filter(participant_id == "P94") %>%
arrange(timestamp) %>%
pull(timestamp) %>%
as.POSIXct()
time_diffs_P94 <- diff(as.numeric(times_P94))
print(time_diffs_P94)
## [1] 5.390355 5.555790 4.686329 5.883100 5.077003 11.506866 8.061762
## [8] 6.053163 8.627218 4.355716 7.086284 4.479886 4.425748 7.077535
## [15] 3.246014 5.369903 6.641802 8.644943 6.693485 5.239444 11.007064
## [22] 9.947733 5.238376 3.341753 4.481994 4.834910 3.779109 7.185179
## [29] 6.680192 8.549585 5.232282 7.691876
If you compare these values with the values in the table in 4.1, you will notice that the latter is smaller than the former by 1. This is because even if I did not read the context or dialogues at all, it takes 1–2 seconds (and sometimes 3 seconds) just to select the rating and click the “next” button to move to the next question. In other words, conservatively assuming that it takes 1 second for these computer operations, I deducted 1 second from the time P96 and P94 spent on each item to estimate how many seconds they spent reading the texts in each item, excluding the first anchor item.
Based on this, observe the table in 4.1 again. The values in the columns “P96’s Time - 1 (s)” and “P94’s Time - 1 (s)” are highlighted if they are smaller than the value in “Fast Reader’s Time (s)” on the same row—that is, if those participants are considered to have read the texts too quickly. As you can see, 31 out of 32 items are highlighted for P96, and 27 out of 32 items are highlighted for P94. This suggests that these participants did not read the texts for most items carefully.
Next, we examine how quickly P24 read the texts, following the same approach as we did for P96 and P94 above. First, the following lists how many seconds P24 spent on each item, excluding the first anchor item.
times_P24 <- experimentResult %>%
filter(participant_id == "P24") %>%
arrange(timestamp) %>%
pull(timestamp) %>%
as.POSIXct()
time_diffs_P24 <- diff(as.numeric(times_P24))
print(time_diffs_P24)
## [1] 4.034482 11.141690 7.881734 6.275509 4.889306 6.645337 5.693556
## [8] 4.669329 4.183593 8.764511 4.223592 7.678386 6.373678 12.592983
## [15] 3.486389 5.385715 4.553883 7.316510 5.166387 2.743009 4.271933
## [22] 2.261056 6.932530 3.590514 4.984707 4.283391 3.534015 2.496334
## [29] 9.199031 6.537922 3.775638 3.755164
Based on this, I created the table in 4.2 below. As you can see, 21 out of 32 items are highlighted, meaning that those items were read too quickly. Therefore, P24 did not carefully read the texts for more than half of the items.
| List1 Items | Context Characters | Dialogue Characters | Total Characters | Fast Reader’s Time (s) | P24’s Time - 1 (s) |
|---|---|---|---|---|---|
| P01 | 19 | 27 | 46 | 2.00 | N/A |
| P02 | 19 | 26 | 45 | 1.96 | 4.03 |
| P03 | 114 | 100 | 214 | 9.30 | 11.14 |
| F18.low | 212 | 19 | 231 | 10.04 | 7.88 |
| F12 | 134 | 30 | 164 | 7.13 | 6.28 |
| F11 | 147 | 21 | 168 | 7.30 | 4.89 |
| F05 | 50 | 61 | 111 | 4.83 | 6.65 |
| F20.low | 111 | 40 | 151 | 6.57 | 5.69 |
| F04 | 121 | 25 | 146 | 6.35 | 4.67 |
| acc.ok.05 | 121 | 89 | 210 | 9.13 | 4.18 |
| F07 | 82 | 63 | 145 | 6.30 | 8.76 |
| F02 | 29 | 40 | 69 | 3.00 | 4.22 |
| F06 | 132 | 19 | 151 | 6.57 | 7.68 |
| acc.bad.02 | 171 | 49 | 220 | 9.57 | 6.37 |
| F10 | 35 | 100 | 135 | 5.87 | 12.59 |
| acc.ok.01 | 229 | 57 | 286 | 12.43 | 3.49 |
| F13 | 32 | 72 | 104 | 4.52 | 5.39 |
| acc.bad.10 | 223 | 22 | 245 | 10.65 | 4.55 |
| F03 | 170 | 85 | 255 | 11.09 | 7.32 |
| F16 | 49 | 68 | 117 | 5.09 | 5.17 |
| acc.ok.03 | 61 | 56 | 117 | 5.09 | 2.74 |
| F09 | 73 | 116 | 189 | 8.22 | 4.27 |
| acc.bad.04 | 182 | 18 | 200 | 8.70 | 2.26 |
| F01 | 83 | 65 | 148 | 6.43 | 6.93 |
| F17.high | 84 | 47 | 131 | 5.70 | 3.59 |
| acc.ok.07 | 106 | 67 | 173 | 7.52 | 4.98 |
| F15 | 57 | 27 | 84 | 3.65 | 4.28 |
| acc.bad.08 | 53 | 107 | 160 | 6.96 | 3.53 |
| F19.high | 156 | 12 | 168 | 7.30 | 2.50 |
| acc.ok.09 | 127 | 156 | 283 | 12.30 | 9.20 |
| F08 | 148 | 99 | 247 | 10.74 | 6.54 |
| F14 | 93 | 23 | 116 | 5.04 | 3.78 |
| acc.bad.06 | 158 | 69 | 227 | 9.87 | 3.76 |
To sum up, participants P96, P94, and P24 should be excluded from the statistical analysis in the next subsection, as these participants likely did not read the texts carefully.
Next, we examine the standard deviation of rating values for each participant (i.e., rating_sd values). The rating_sd values indicate how spread out or variable a participant’s ratings are around their own average rating. Participants with very low rating_sd values are flagged for further review, as this might indicate that the participant is not thoughtfully differentiating between items. First, consider the boxplot for rating standard deviations below.
#Calculate the standard deviation of ratings for each participant and sort the results in ascending order.
responseVariability <- experimentResult %>%
group_by(participant_id) %>%
summarise(
rating_sd = sd(rating, na.rm = TRUE),
.groups = "drop"
)%>%
arrange(rating_sd)
ggplot(responseVariability, aes(y = rating_sd)) +
geom_boxplot(width = 0.5, outlier.shape = NA) +
geom_point(aes(x = 0), position = position_dodge(width = 0.3)) +
theme_minimal() +
labs(title = "Distribution of Rating Standard Deviations",
y = "Standard Deviation of Ratings") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())
Figure 4.6: Boxplot showing the distribution of standard deviations in participant ratings. The median standard deviation is approximately 2.3 (horizontal line), with most participants showing standard deviations between 2.1-2.5 (box edges representing the interquartile range). Several outliers with notably low variability (standard deviations below 1.5) are visible at the bottom of the plot
The distribution of rating standard deviations is centered around 2.2–2.3, indicating that participants showed reasonable differentiation in their ratings. However, there are a few notably low standard deviation values around 1.0–1.5, which fall well below the typical range. The relatively narrow interquartile range in the boxplot helps highlight how these low values deviate substantially from the normal pattern of response variability observed among other participants. These participants stand out as potential concerns, as their unusually low rating variability suggests they may not have been making meaningful distinctions between items in their ratings. This case should be flagged for further review to assess the quality and validity of their responses.
To examine participants whose standard deviation values deviate notably from others, I set the lower bound as Quantile 1 - 1.5 * interquartile range and the upper bound as Quantile 3 + 1.5 * interquartile range. I will further examine participants whose standard deviation values are lower than the lower bound or higher than the upper bound.
# Calculate quantiles and bounds
q1 <- quantile(responseVariability$rating_sd, 0.25)
q3 <- quantile(responseVariability$rating_sd, 0.75)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
# Identify "outliers".
outliers <- responseVariability %>%
mutate(is_outlier = rating_sd < lower_bound | rating_sd > upper_bound) %>%
filter(is_outlier)
# Print outlier information
if(nrow(outliers) > 0) {
print("Potential outliers found:")
for(i in 1:nrow(outliers)) {
print(paste("Participant", outliers$participant_id[i],
"with SD =", round(outliers$rating_sd[i], 3),
ifelse(outliers$rating_sd[i] < lower_bound,
"(below lower bound)", "(above upper bound)")))
}
} else {
print("\nNo outliers found")
}
## [1] "Potential outliers found:"
## [1] "Participant P4 with SD = 0.966 (below lower bound)"
## [1] "Participant P94 with SD = 1.458 (below lower bound)"
## [1] "Participant P24 with SD = 1.478 (below lower bound)"
It was already determined that Participant P94 and P24 will be excluded from our statistical analysis in the next sub-section as their participation duration in the experiment was too short, suggesting that they likely did not read the texts carefully. Therefore, our main focus below is whether or not we exclude P4. Also remember that the trend line for participant P26 only shows an upward slope, indicating an increase in the measured variable from condition 0 to condition 1, which is opposite from all the other partcipants. Therefore, we will look into this partcipant data as well in what follows.
First we will examine P4’s and P26’s ratings for the three anchor questions in comparison with the average ratings and ratings by P96, P94, and P24 (i.e., participants who were determined to be excluded our analysis in the next subsection).
# Define constants
unusualParticipants <- c("P4", "P26", "P96", "P94", "P24")
selected_items <- as.character(1:3)
# Function to create comparison dataset for a single participant
create_comparison_data <- function(participant_id, experiment_result) {
bind_rows(
experiment_result %>%
filter(participant_id == !!participant_id,
experiment_id == "Practice",
item_id %in% selected_items) %>%
mutate(comparison = paste("Average vs", participant_id)),
experiment_result %>%
filter(experiment_id == "Practice",
item_id %in% selected_items) %>%
group_by(item_id) %>%
summarise(rating = mean(rating, na.rm = TRUE)) %>%
mutate(participant_id = "Average",
comparison = paste("Average vs", !!participant_id))
)
}
# Create datasets for all participants
all_plot_data <- bind_rows(
map(unusualParticipants, ~create_comparison_data(.x, experimentResult))
)
# Reorder the facets by specifying the order of comparisons
all_plot_data <- all_plot_data %>%
mutate(comparison = factor(comparison,
levels = c("Average vs P4",
"Average vs P26",
"Average vs P24",
"Average vs P94",
"Average vs P96")))
# Create the faceted plot
ggplot(all_plot_data, aes(x = item_id, y = rating, group = participant_id, color = participant_id)) +
geom_point(size = 3) +
geom_line(linewidth = 1) +
facet_wrap(~comparison, ncol = 2) +
scale_color_manual(values = c("Average" = "black",
"P94" = "#4DAF4A",
"P96" = "#FF8C00",
"P24" = "#E41A1C",
"P26" = "#984EA3",
"P4" = "#377EB8")) +
theme_minimal() +
theme(
strip.text = element_text(size = 14, face = "bold"),
panel.spacing = unit(3, "lines"),
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "bottom",
legend.text = element_text(size = 12),
plot.margin = margin(1, 1, 1, 1),
panel.grid.minor = element_blank(),
legend.margin = margin(t = 1)
) +
labs(title = "Ratings of Practice Items: Individual Participants vs Average",
x = "Anchor Sentence ID",
y = "Rating",
color = "Participant ID") +
guides(color = guide_legend(order = 1, nrow = 2))
Figure 4.7: Comparison of individual participant ratings (colored lines) versus average ratings (black line) for practice/anchor items. The average pattern shows a characteristic dip for anchor sentence 2, while participants P4, P26, P94, and P96 display notable deviations from this pattern.
P4’s and P26’s ratings for the three anchor sentences (“6, 7, 6” and “5, 7, 6”) deviate from the average ratings (very natural, very unnatural, and somewhat natural/unnatural). While P4 and P26 rated the first anchor sentence appropriately, they gave the same or higher ratings to the second and third anchor sentences, even though these sentences were rated as the lowest and second lowest on average among the three anchor sentences. This suggests that P4 and P26 may not have understood the task or the scale, or they may have been responding carelessly. P96’s and P94’s ratings of the anchor sentences also deviate from those of many other participants, further supporting the decision to exclude these participants (as far as anchor sentence ratings are concerned, P24’s ratings are consistent with those of many other participants).
unusualParticipants <- c("P96", "P94", "P24", "P26", "P4")
selected_items <- as.character(1:16)
create_comparison_data <- function(participant_id, experiment_result) {
bind_rows(
experiment_result %>%
filter(participant_id == !!participant_id,
experiment_id == "Filler",
item_id %in% selected_items) %>%
mutate(comparison = paste("Average vs", participant_id)),
experiment_result %>%
filter(experiment_id == "Filler",
item_id %in% selected_items) %>%
group_by(item_id) %>%
summarise(rating = mean(rating, na.rm = TRUE)) %>%
mutate(participant_id = "Average",
comparison = paste("Average vs", !!participant_id))
)
}
all_plot_data <- bind_rows(
map(unusualParticipants, ~create_comparison_data(.x, experimentResult))
)
all_plot_data <- all_plot_data %>%
mutate(comparison = factor(comparison,
levels = c("Average vs P4",
"Average vs P26",
"Average vs P24",
"Average vs P94",
"Average vs P96")))
ggplot(all_plot_data, aes(x = item_id, y = rating, group = participant_id, color = participant_id)) +
geom_point(size = 3) +
geom_line(linewidth = 1) +
facet_wrap(~comparison, ncol = 2) +
scale_color_manual(values = c("Average" = "black",
"P94" = "#4DAF4A",
"P96" = "#FF8C00",
"P24" = "#E41A1C",
"P26" = "#984EA3",
"P4" = "#377EB8")) +
theme_minimal() +
theme(
strip.text = element_text(size = 14, face = "bold"),
panel.spacing = unit(3, "lines"),
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "bottom",
legend.text = element_text(size = 12),
plot.margin = margin(1, 1, 1, 1), # Added more margin around the entire plot
panel.grid.minor = element_blank(), # Removed minor grid lines for cleaner look
legend.margin = margin(t = 1) # Added space above the legend
) +
labs(title = "Ratings of Selected Filler Items: Individual Participants vs Average",
x = "Item ID",
y = "Rating",
color = "Participant ID") +
guides(color = guide_legend(order = 1, nrow = 2)) # Legend in two rows for better space usage
Figure 4.8: Ratings of selected filler items comparing individual participants (colored lines) with average ratings (black line) across 16 items. Each panel shows a different participant’s rating pattern (P4, P26, P24, P94, and P96) versus the average, revealing varying degrees of consistency with the group norm.
Based on the plot showing ratings for selected filler items, the participants exhibit varying patterns of deviation from the average ratings. P4’s ratings notably diverge from the average pattern, with a consistent trend of higher ratings than the average. This is particularly evident in the first half of the items, where P4 assigns ratings of 4–6 compared to the average ratings of around 1.5–2.5. The deviation becomes less pronounced but still persists in the latter half of the sequence, where P4’s ratings generally remain 1–2 points above the average. Similarly, P26’s ratings show substantial divergence from the average pattern but with more extreme fluctuations. Their ratings oscillate dramatically between very high (6–7) and very low (1–2) values, creating a more erratic pattern compared to the relatively stable average ratings. These pronounced deviations in rating patterns, combined with the unexpected ratings for anchor sentences and the substantially low standard deviation of ratings for P4, lead to the exclusion of both participants P4 and P26 from the statistical analysis in the next section.
It should be noted that both P94 and P96 also show substantial deviations from the average pattern: P94 gives high ratings for filler 1, 2, and 4, while P96 gives low ratings for filler 11, 13, 14, and 15, as well as high ratings for filler 1, 2, 4, and 5. This finding further justifies the decision to exclude these participants from the statistical analysis in the next subsection.
In contrast to P4, P26, P94, and P96, P24’s ratings follow a similar upward trend to the average ratings across the filler items, suggesting that P24’s responses demonstrate reasonable sensitivity to the differences between filler items. However, while P24’s ratings for anchor and filler items are reasonable, it is worth noting again that 21 out of 32 items were rated unreasonably quickly by P24, and P24’s standard deviation values are also below the lower bound. All things considered, it is likely that while P24 read the sentences to be rated carefully, P24 did not read the descriptions of the contexts in which the sentences were uttered. This explains why P24’s ratings for anchor and filler items—whose naturalness is independent of context—are reasonable, while P24 completed the experiment unreasonably quickly. Since the ratings of main items are considered to be context-dependent (a hypothesis the experiment is testing), I exclude P24, who likely did not read the contexts carefully.
So far, we’ve identified unusual participants by first analyzing their response times and the standard deviation of their ratings. This was followed by a closer look at their ratings for anchor and filler items. This method helped us identify participants who rushed through the experiment without properly reading the texts, as well as those who failed to thoughtfully differentiate between items. Next, we’ll take an approach to determine which participants stand out in their ratings for anchor and filler items. This will help us identify individuals whose acceptability judgments frequently and significantly differ from those of the majority of Japanese speakers. To achieve this, we’ll first need to establish what constitutes unreasonable ratings for these items and determine how many such ratings indicate that a participant’s judgments are consistently at odds with the majority. The following plots show the average ratings of anchor and filler items with 95% confidence intervals.
# Filler items
filler_summary <- experimentResult %>%
filter(experiment_id == "Filler",
item_id %in% selected_items) %>%
group_by(item_id) %>%
summarise(
average_rating = mean(rating, na.rm = TRUE),
ci_95 = 1.96 * sd(rating, na.rm = TRUE) / sqrt(n())
) %>%
arrange(item_id)
# Practice items
practice_summary <- experimentResult %>%
filter(experiment_id == "Practice",
item_id %in% selected_items) %>%
group_by(item_id) %>%
summarise(
average_rating = mean(rating, na.rm = TRUE),
ci_95 = 1.96 * sd(rating, na.rm = TRUE) / sqrt(n())
) %>%
arrange(item_id)
# Plot for Filler items
ggplot(filler_summary, aes(x = item_id, y = average_rating)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = average_rating - ci_95,
ymax = average_rating + ci_95),
width = 0.2) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
) +
scale_x_continuous(breaks = 1:16) + # Set specific breaks for x-axis
labs(title = "Average Ratings for Fillers with 95% Confidence Intervals",
x = "Item ID",
y = "Average Rating")
Figure 4.9: Average ratings for 16 different filler items, with each item identified by a numeric ID (1-16) on the x-axis. The y-axis shows the average rating scale ranging from 1 to 7. Each item is represented by a point indicating its mean rating, with vertical error bars showing the 95% confidence intervals.
# Plot for Practice items
ggplot(practice_summary, aes(x = item_id, y = average_rating)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = average_rating - ci_95,
ymax = average_rating + ci_95),
width = 0.2) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
) +
scale_x_continuous(breaks = 1:3) + # Set specific breaks for x-axis
labs(title = "Average Ratings for Anchors with 95% Confidence Intervals",
x = "Item ID",
y = "Average Rating")
Figure 4.10: Average ratings for 3 anchor items with 95% confidence intervals. The y-axis shows the average rating scale ranging from 1 to 7. Item 1 received the highest rating (approximately 6), followed by item 3 (approximately 4), while item 2 received the lowest rating (approximately 2.5).
In the experiment, participants rated the naturalness of sentences on a scale of 1 (very unnatural) to 7 (very natural), with 4 as the threshold. That is, if a rating is above 4, the sentence is considered ‘rather natural,’ and if it is below 4, the sentence is considered ‘rather unnatural.’ Given this, to identify participants whose ratings for anchor and filler items were suspicious or unusual, I apply the following criteria based on the expected acceptability of each item. For anchor items, we flag ratings as unusual if they deviate significantly from the expected patterns:
For filler items, I use similar logic:
The code below implements these criteria and identifies participants who provided 7 or more unusual ratings. With a total of 19 anchor and filler items, a participant with 6 unusual ratings (i.e., 13 usual ratings) would have rated approximately 70% of the items as expected. Based on this, I set the threshold for unacceptable behavior at 7 or more unusual ratings. While this number might seem lenient, it accounts for the fact that ratings for some items can vary significantly with minor changes, such as deleting or replacing a case particle. As a result, it’s reasonable for participants to provide one or two unusual ratings. Furthermore, decisions about excluding participants should be conservative, not only to maintain statistical power but also for ethical reasons—researchers should avoid manipulating data unless it is clearly justified. For these reasons, I set the threshold at 7 or more unusual ratings.
As you can see below, the code identifies P13, P28, and P79 as participants with a large number of unusual ratings.
# Define unusual ratings for anchor and filler items
unusualRatings <- experimentResult %>%
mutate(
unusual = case_when(
# Anchor items (Practice 1, 2, 3)
(experiment_id == "Practice" & item_id == 1 & rating <= 2) ~ 1, # Anchor1: Clearly acceptable (expected high rating)
(experiment_id == "Practice" & item_id == 2 & rating >= 6) ~ 1, # Anchor2: Clearly unacceptable (expected low rating)
(experiment_id == "Practice" & item_id == 3 & (rating <= 1 | rating >= 7)) ~ 1, # Anchor3: Somewhat acceptable/unacceptable (expected mid-range rating)
# Filler items (Filler 1-16)
(experiment_id == "Filler" & item_id %in% c(1:8) & rating >= 6) ~ 1, # F1-F8: Clearly unacceptable (expected low rating)
(experiment_id == "Filler" & item_id %in% c(9, 13:16) & rating <= 2) ~ 1, # F9, F13-F16: Clearly acceptable (expected high rating)
(experiment_id == "Filler" & item_id %in% c(10:12) & (rating <= 1 | rating >= 7)) ~ 1, # F10-F12: Somewhat acceptable/unacceptable (expected mid-range rating)
# All other cases are not unusual
TRUE ~ 0
)
)
# Count unusual ratings per participant
unusualRatings_counts <- unusualRatings %>%
group_by(participant_id) %>%
summarise(
unusualRatings_total = sum(unusual), # Total number of unusual ratings
.groups = "drop"
)
# Identify participants with more than 7 unusual rating
participants_with_multiple_unusual_ratings <- unusualRatings_counts %>%
filter(unusualRatings_total >= 7) %>% # Keep only participants with more than one weird rating
arrange(participant_id) # Sort by participant_id (optional)
print(participants_with_multiple_unusual_ratings, n=Inf)
## # A tibble: 3 × 2
## participant_id unusualRatings_total
## <chr> <dbl>
## 1 P13 7
## 2 P28 8
## 3 P79 7
In what follows, we examine the actual ratings of the anchor and filler items in question by P13, P28, and P79, comparing them to the average ratings.
unusualParticipants2 <- c("P13", "P28", "P79")
anchorItems <- as.character(1:3) # Adjusted to match your original plot's item range
create_comparison_data <- function(participant_id, experiment_result) {
bind_rows(
experiment_result %>%
filter(participant_id == !!participant_id,
experiment_id == "Practice",
item_id %in% anchorItems) %>%
mutate(comparison = paste("Average vs", participant_id)),
experiment_result %>%
filter(experiment_id == "Practice",
item_id %in% anchorItems) %>%
group_by(item_id) %>%
summarise(rating = mean(rating, na.rm = TRUE)) %>%
mutate(participant_id = "Average",
comparison = paste("Average vs", !!participant_id))
)
}
all_plot_data <- bind_rows(
map(unusualParticipants2, ~create_comparison_data(.x, experimentResult))
)
all_plot_data <- all_plot_data %>%
mutate(comparison = factor(comparison,
levels = c("Average vs P13",
"Average vs P28",
"Average vs P79")))
ggplot(all_plot_data, aes(x = item_id, y = rating, group = participant_id, color = participant_id)) +
geom_point(size = 3) +
geom_line(linewidth = 1) +
facet_wrap(~comparison, ncol = 2) +
scale_color_manual(values = c("Average" = "black",
"P13" = "#4DAF4A",
"P28" = "#E41A1C",
"P79" = "#377EB8")) +
theme_minimal() +
theme(
strip.text = element_text(size = 14, face = "bold"),
panel.spacing = unit(3, "lines"),
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "bottom",
legend.text = element_text(size = 12),
plot.margin = margin(1, 1, 1, 1),
panel.grid.minor = element_blank(),
legend.margin = margin(t = 1)
) +
labs(title = "Ratings of Practice Items: Individual Participants vs Average",
x = "Anchor Sentence ID",
y = "Rating",
color = "Participant ID") +
guides(color = guide_legend(order = 1, nrow = 2))
Figure 4.11: Ratings of practice items comparing three individual participants (P13, P28, P79) with average ratings (black line) across three anchor sentences. While the average shows a characteristic dip for sentence 2, participants P13 and P28 rated all sentences consistently at 7, and P79 showed a different pattern with high ratings for sentences 1-2 and a sharp decline for sentence 3.
Looking at participants’ ratings for practice (anchor) items, several concerning patterns emerge that suggest potentially unreliable responses. Both P13 and P28 show completely flat lines at the maximum rating (7), indicating they failed to discriminate between these calibration items. This is in stark contrast to the average pattern, which shows systematic variation (starting high, dipping in the middle, and rising slightly at the end). P79 rated the second practice item as 7, deviating markedly from the expected rating, which is around 2. These responses to the practice items suggest that these participants may not have engaged meaningfully with the calibration phase of the experiment.
unusualParticipants <- c("P13", "P28", "P79")
selected_items <- as.character(1:16)
create_comparison_data <- function(participant_id, experiment_result) {
bind_rows(
experiment_result %>%
filter(participant_id == !!participant_id,
experiment_id == "Filler",
item_id %in% selected_items) %>%
mutate(comparison = paste("Average vs", participant_id)),
experiment_result %>%
filter(experiment_id == "Filler",
item_id %in% selected_items) %>%
group_by(item_id) %>%
summarise(rating = mean(rating, na.rm = TRUE)) %>%
mutate(participant_id = "Average",
comparison = paste("Average vs", !!participant_id))
)
}
all_plot_data <- bind_rows(
map(unusualParticipants, ~create_comparison_data(.x, experimentResult))
)
ggplot(all_plot_data, aes(x = item_id, y = rating, group = participant_id, color = participant_id)) +
geom_point(size = 3) +
geom_line(linewidth = 1) +
facet_wrap(~comparison, ncol = 2) +
scale_color_manual(values = c("Average" = "black",
"P13" = "#4DAF4A",
"P28" = "#E41A1C",
"P79" = "#377EB8")) +
theme_minimal() +
theme(
strip.text = element_text(size = 14, face = "bold"),
panel.spacing = unit(3, "lines"),
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "bottom",
legend.text = element_text(size = 12),
plot.margin = margin(1, 1, 1, 1), # Added more margin around the entire plot
panel.grid.minor = element_blank(), # Removed minor grid lines for cleaner look
legend.margin = margin(t = 1) # Added space above the legend
) +
labs(title = "Ratings of Selected Filler Items: Individual Participants vs Average",
x = "Item ID",
y = "Rating",
color = "Participant ID") +
guides(color = guide_legend(order = 1, nrow = 2)) # Legend in two rows for better space usage
Figure 4.12: Ratings of selected filler items comparing individual participants (P13, P28, P79) with average ratings (black line) across 16 items. Each panel shows distinct rating patterns: P13 exhibits high variability with many maximum ratings, P28 alternates between very low and very high ratings, and P79 consistently gives low ratings for most items except the final few.
The three participants showed distinct patterns that deviated from the average ratings in different ways. P13 (shown in green) consistently gave extreme ratings around 7 on multiple items, while the average ratings remained between 2–5. For many filler items, P13’s ratings were approximately 3–4 points higher than the average. P28 (shown in red) demonstrated a pattern of alternating between very low ratings (around 1) and high ratings (around 6–7), creating a stark contrast with the more moderate average ratings, which generally stayed within the 2–5 range. P79 (shown in blue) exhibited the opposite pattern from the other two participants, consistently providing very low ratings (around 1) for most filler items, while the average ratings showed more variation between 2–6. P79’s ratings were often 2–4 points lower than the average ratings. These systematic deviations from the average suggest that these participants may have been using different rating criteria or potentially not engaging with the rating task in the same way as other participants. For these reasons, I exclude P13, P28, and P79 as well.
In the rest of this section, I filter out participants P4, P13, P24, P26, P28, P79, P94, and P96, and examine the difference between condition 0 and condition 1 sentences.
experimentResult <- experimentResult %>%
filter(!(participant_id %in% c("P4", "P13", "P24", "P26", "P28", "P79", "P94", "P96")))
ggplot(experimentResult %>% filter(experiment_id == "AccCase"), aes(x = factor(condition), y = rating)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Rating",
color = "Condition",
title = "Ratings Across Conditions for AccCase Items"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.13: Ratings across conditions for AccCase items. The scatterplot shows higher ratings clustered around 5-6 for condition 0 (pink) and lower ratings concentrated around 1-2 for condition 1 (teal), with a clear downward trend shown by the regression line.
The scatter plot reveals a clear downward trend in ratings from condition 0 to condition 1 for AccCase items. In condition 0, ratings are predominantly clustered at higher values (around 5–6), while in condition 1, ratings are concentrated at lower values (around 1–2). The blue trend line visually illustrates this significant decline in ratings between the two conditions, suggesting a consistent and substantial decrease in participant responses across AccCase items.
The exact rating values for condition 0 and condition 1 sentences, along with their standard deviations, are provided below.
# Calculate summary statistics
summary_stats <- experimentResult %>%
filter(experiment_id == "AccCase") %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating, na.rm = TRUE),
standard_deviation = sd(rating, na.rm = TRUE)
)
print(summary_stats)
## # A tibble: 2 × 3
## condition average_rating standard_deviation
## <dbl> <dbl> <dbl>
## 1 0 5.72 1.57
## 2 1 2.14 1.47
The table provides numerical support for the visual trend. The average rating for condition 0 is 5.72, with a standard deviation of 1.57, indicating that ratings for condition 0 were high with moderate variability. In contrast, condition 1 shows a much lower average rating of 2.14, with a standard deviation of 1.47. This confirms the graphical representation, quantitatively demonstrating a marked difference in ratings between condition 0 and condition 1, with the spread of ratings remaining relatively consistent across both conditions. Note that the difference in ratings between condition 0 and condition 1 is 3.56, which is almost as large as the difference (i.e., 3.67) between the very natural and very unnatural anchor sentences.
Next, we turn to the difference between condition 0 and condition 1 for each AccCase item to examine variations among items due to the use of particular lexical items.
custom_labeller <- function(labels) {
labels$item_id <- paste0("item ", labels$item_id)
labels
}
ggplot(experimentResult %>% filter(experiment_id == "AccCase"), aes(x = factor(condition), y = rating)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) + # Trend line with confidence intervals
facet_wrap(~ item_id, labeller = custom_labeller) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Rating",
color = "Condition",
title = "Trend Lines and Raw Data by Condition for Each Item"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.14: Individual item ratings by condition across 10 items. Each panel displays raw data points and trend lines comparing condition 0 (pink, higher ratings) to condition 1 (teal, lower ratings), with consistent downward trends across all items.
The figure displays trend lines and raw data points for 10 different items across two conditions. Each item shows a consistent downward trend from condition 0 to condition 1, with ratings typically high (around 5–6) in condition 0 and lower (around 1–2) in condition 1. The teal trend lines uniformly indicate a decline in ratings, suggesting a systematic difference in participant responses across all items.
The exact rating values for condition 0 and condition 1 sentences for each AccCase item are provided below.
itemBy_averageRating_accCase <- experimentResult %>%
filter(experiment_id == "AccCase") %>%
group_by(item_id) %>%
summarise(
avgRating_condition0 = mean(rating[condition == 0], na.rm = TRUE),
avgRating_condition1 = mean(rating[condition == 1], na.rm = TRUE),
diff_condition_0_1 = avgRating_condition0 - avgRating_condition1 # Calculate the difference
)
print(itemBy_averageRating_accCase, n = Inf)
## # A tibble: 10 × 4
## item_id avgRating_condition0 avgRating_condition1 diff_condition_0_1
## <dbl> <dbl> <dbl> <dbl>
## 1 1 4.83 1.45 3.38
## 2 2 5.80 2.12 3.68
## 3 3 5.79 2.20 3.58
## 4 4 6.77 3.29 3.49
## 5 5 6.17 2.16 4.01
## 6 6 6.64 2.55 4.09
## 7 7 5.17 1.52 3.64
## 8 8 5.14 1.59 3.55
## 9 9 5.83 2.39 3.45
## 10 10 5.02 2.19 2.83
The table provides detailed numerical insights for each item. The average ratings in condition 0 range from 4.83 to 6.77, while the ratings in condition 1 range from 1.45 to 3.29. The “diff_condition_0_1” column shows the difference between conditions, with values between 2.83 and 4.09, consistently indicating a substantial difference in ratings between condition 0 and condition 1 across all items.
Lastly, we turn to the difference between condition 0 and condition 1 for each participant to examine variations among participants.
ggplot(
experimentResult %>%
filter(experiment_id == "AccCase") %>%
mutate(participant_id = fct_relevel(participant_id, mixedsort(unique(participant_id)))),
aes(x = factor(condition), y = rating)
) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
facet_wrap(~ participant_id, ncol = 8) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
#scale_y_continuous(limits = c(0, 7), breaks = seq(0, 7, 1)) +
scale_y_continuous(limits = c(-0.1, 7.1)) +
labs(
x = "Condition",
y = "Rating",
color = "Condition",
title = "Trend Lines and Raw Data by Condition for Each Participant"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5) # Ensure proper alignment for "0" and "1"
)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 58 rows containing missing values or values outside the scale range
## (`geom_point()`).
Figure 4.15: Individual participant ratings by condition. Each panel represents one participant’s ratings with trend lines comparing condition 0 (pink points) to condition 1 (teal points). All 86 participants show consistent downward trends between conditions with confidence intervals (gray shading).
The figure displays trend lines and raw data points for 86 different participants across two conditions. Every participant shows a consistent downward trend from condition 0 to condition 1, with ratings typically higher (around 5–6) in condition 0 and lower (around 1–2) in condition 1. The blue trend lines uniformly indicate a decline in ratings, suggesting a systematic difference in participant responses across all participants. Among these 86 participants, P47, P76, P92, and P91 reported having studied linguistics before. The first three of these participants show substantial differences between condition 0 and condition 1, while P91 shows a smaller difference with more variation in ratings across items. One difference I observe regarding their linguistics background is that P47, P76, and P92 seem to have studied a wide range of linguistics subfields, including syntax, while P91 reports studying only phonetics. Therefore, it is possible that participants who have studied syntax, semantics, and/or morphology are more likely to observe the difference between condition 0 and condition 1. However, I leave this hypothesis for future research, as the current study does not involve many participants with a linguistics background.
The exact rating values for condition 0 and condition 1 sentences from each participant are provided below.
participantBy_averageRating_accCase <- experimentResult %>%
filter(experiment_id == "AccCase") %>%
group_by(participant_id) %>%
summarise(
avgRating_condition0 = mean(rating[condition == 0], na.rm = TRUE),
avgRating_condition1 = mean(rating[condition == 1], na.rm = TRUE),
diff_condition_0_1 = avgRating_condition0 - avgRating_condition1
) %>%
arrange(desc(diff_condition_0_1))
print(participantBy_averageRating_accCase, n = Inf)
## # A tibble: 86 × 4
## participant_id avgRating_condition0 avgRating_condition1 diff_condition_0_1
## <chr> <dbl> <dbl> <dbl>
## 1 P50 6.8 1 5.8
## 2 P87 6.8 1 5.8
## 3 P15 7 1.4 5.6
## 4 P62 6.8 1.2 5.6
## 5 P80 6.6 1 5.6
## 6 P99 7 1.4 5.6
## 7 P18 7 1.6 5.4
## 8 P78 6.4 1 5.4
## 9 P92 6.4 1.2 5.2
## 10 P61 6.6 1.6 5
## 11 P65 6.6 1.6 5
## 12 P75 6.4 1.4 5
## 13 P76 6.8 1.8 5
## 14 P66 6.4 1.6 4.8
## 15 P69 6 1.2 4.8
## 16 P84 5.8 1 4.8
## 17 P36 6.4 1.8 4.6
## 18 P102 5.6 1 4.6
## 19 P57 5.6 1 4.6
## 20 P40 6.4 2 4.4
## 21 P67 5.4 1 4.4
## 22 P49 6.6 2.2 4.4
## 23 P73 5.6 1.2 4.4
## 24 P21 6.4 2.2 4.2
## 25 P71 5.2 1 4.2
## 26 P47 6.8 2.6 4.2
## 27 P81 5.6 1.4 4.2
## 28 P12 6.2 2.2 4
## 29 P20 6.2 2.2 4
## 30 P70 5 1 4
## 31 P74 6.8 2.8 4
## 32 P83 5.2 1.2 4
## 33 P68 5.6 1.6 4
## 34 P97 5.4 1.6 3.8
## 35 P7 4.8 1 3.8
## 36 P89 5.8 2 3.8
## 37 P52 5.4 1.75 3.65
## 38 P101 5.4 1.8 3.6
## 39 P48 5.4 1.8 3.6
## 40 P9 5.4 1.8 3.6
## 41 P60 5 1.4 3.6
## 42 P88 7 3.4 3.6
## 43 P17 4.6 1 3.6
## 44 P22 5.8 2.2 3.6
## 45 P33 4.6 1 3.6
## 46 P41 5.8 2.2 3.6
## 47 P43 4.6 1 3.6
## 48 P59 6.6 3 3.6
## 49 P25 5.4 2 3.4
## 50 P51 4.6 1.2 3.4
## 51 P23 6.4 3.2 3.2
## 52 P64 6.4 3.2 3.2
## 53 P82 7 3.8 3.2
## 54 P85 5.2 2 3.2
## 55 P95 6 2.8 3.2
## 56 P27 6.8 3.6 3.2
## 57 P55 4.6 1.4 3.2
## 58 P54 6.2 3.2 3
## 59 P58 5.8 2.8 3
## 60 P8 5.2 2.2 3
## 61 P56 4.6 1.6 3
## 62 P90 6.6 3.6 3
## 63 P72 4.2 1.4 2.8
## 64 P11 6.6 3.8 2.8
## 65 P35 5.8 3 2.8
## 66 P42 6 3.4 2.6
## 67 P93 6.2 3.6 2.6
## 68 P16 6.6 4 2.6
## 69 P2 4.8 2.2 2.6
## 70 P29 5.6 3 2.6
## 71 P53 5.8 3.2 2.6
## 72 P6 4.8 2.2 2.6
## 73 P10 6.4 4 2.4
## 74 P38 5.4 3 2.4
## 75 P86 4.8 2.4 2.4
## 76 P19 3.2 1 2.2
## 77 P1 4 2 2
## 78 P3 3.2 1.2 2
## 79 P77 5.4 3.6 1.8
## 80 P100 5.6 3.8 1.8
## 81 P46 6 4.2 1.8
## 82 P91 4.8 3 1.8
## 83 P14 5.4 3.8 1.6
## 84 P44 4.8 3.4 1.4
## 85 P34 4.6 3.4 1.2
## 86 P5 3.8 2.6 1.2
The table provides detailed numerical insights for each participant. The average ratings in condition 0 range from 3.2 (P3, P19) to 7.0 (P3, P19), while the ratings in condition 1 range from 1.0 (P50, P80, P87, etc.) to 4.2 (P46). The “diff_condition_0_1” column shows the differences between conditions, consistently indicating a substantial decrease in ratings from condition 0 to condition 1 for almost all participants. Most participants experience a rating drop between 3.2 and 5.8 points, demonstrating a remarkably uniform pattern of response differences across the dataset.
As mentioned in Section 3.1, 23 out of 86 participants seem to be able to manage basic conversation in a language other than Japanese (for simplicity, this paper refers to these participants as bilingual speakers/participants, in contrast to the others, who are referred to as monolingual speakers/participants). To visually explore whether foreign language knowledge affects participants’ grammaticality judgments, Figure 4.16 compares the distribution of ratings across conditions between monolingual and bilingual participants.
ggplot(experimentResult %>% filter(experiment_id == "AccCase"), aes(x = factor(condition), y = rating)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Rating",
color = "Condition",
title = "Ratings Across Conditions between Monolingual and Bilingual Participants"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
) +
facet_wrap(~ foreignLanguage, labeller = labeller(foreignLanguage = c("0" = "Participants who speak no foreign language", "1" = "Participants who speak a foreign language")))
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.16: Ratings across conditions for AccCase items by participants’ language background. The figure compares condition effects between monolingual Japanese speakers (left panel) and bilingual speakers of Japanese and a foreign language (right panel), showing nearly identical patterns of high ratings for condition 0 (pink) and low ratings for condition 1 (teal) regardless of language background.
The figure presents a two-panel scatter plot comparing how monolingual and bilingual Japanese speakers rate AccCase items across two conditions. Each panel displays individual rating data points on a scale from 0 to 7, with pink dots representing Condition 0 and teal dots representing Condition 1. Both participant groups show remarkably similar patterns, with a clear decline in ratings from Condition 0 to Condition 1 (as indicated by the blue trend lines with negative slopes). The distribution of ratings appears nearly identical between the two groups, with Condition 0 ratings clustering primarily between 4–7 and Condition 1 ratings clustering between 0–3. Despite the participants’ different language backgrounds, the slope and position of the trend lines are strikingly similar, suggesting that knowledge of a foreign language has minimal impact on participants’ judgments of case marker acceptability in Japanese.
To quantify any potential differences between monolingual and bilingual participants, we calculate summary statistics for each group’s ratings across both experimental conditions.
# Summary statistics for participants who speak Chinese, English, or Spanish in addition to Japanese
summary_stats_foreignLanguage <- experimentResult %>%
filter(experiment_id == "AccCase") %>% # Filter only the "AccCase" experiment
filter(foreignLanguage %in% c("1")) %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating, na.rm = TRUE),
standard_deviation = sd(rating, na.rm = TRUE),
n = n()
)
print(summary_stats_foreignLanguage)
## # A tibble: 2 × 4
## condition average_rating standard_deviation n
## <dbl> <dbl> <dbl> <int>
## 1 0 5.75 1.47 110
## 2 1 2.25 1.58 110
# Summary statistics for all other participants
summary_stats_noForeignLanguage <- experimentResult %>%
filter(experiment_id == "AccCase") %>% # Filter only the "AccCase" experiment
filter(foreignLanguage %in% c("0")) %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating, na.rm = TRUE),
standard_deviation = sd(rating, na.rm = TRUE),
n = n()
)
print(summary_stats_noForeignLanguage)
## # A tibble: 2 × 4
## condition average_rating standard_deviation n
## <dbl> <dbl> <dbl> <int>
## 1 0 5.71 1.60 319
## 2 1 2.10 1.43 319
For bilingual participants who speak Chinese, English, or Spanish in addition to Japanese (n=110), condition 0 sentences received an average rating of 5.75 (SD=1.47), while condition 1 sentences received a substantially lower average rating of 2.25 (SD=1.58). Similarly, for monolingual Japanese speakers (n=319), condition 0 sentences received an average rating of 5.71 (SD=1.60), while condition 1 sentences received an average rating of 2.10 (SD=1.43). These numerical results confirm the visual patterns observed in the figure, with virtually identical ratings in condition 0 (a difference of only 0.04 points) and very similar ratings in condition 1 (a difference of only 0.15 points). The standard deviations are also comparable between groups, indicating similar levels of response variability regardless of language background. These findings suggest that knowledge of foreign languages does not meaningfully influence participants’ judgments of case marker acceptability in Japanese (but see Section 4.2).
Next, we turn to the difference in ratings of main items between speakers of Eastern and Western dialects.
ggplot(experimentResult %>% filter(experiment_id == "AccCase"), aes(x = factor(condition), y = rating)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Rating",
color = "Condition",
title = "Ratings Across Conditions for AccCase Items by Dialect"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
) +
facet_wrap(~ dialect, labeller = labeller(dialect = c("east" = "East Dialect", "west" = "West Dialect")))
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.17: Ratings across conditions for AccCase items by dialect. The figure compares condition effects between Eastern and Western Japanese dialects, showing similar downward trends from condition 0 (pink) to condition 1 (teal) but with subtle differences in rating patterns based on dialect.
The figure presents a two-panel scatter plot comparing how speakers of Eastern and Western dialects rate AccCase items across two conditions. Each panel displays individual rating data points on a scale from approximately 0 to 7, with pink dots representing Condition 0 and teal dots representing Condition 1. While both dialect groups show a clear decline in ratings from Condition 0 to Condition 1 (as indicated by the blue trend lines with negative slopes), there are subtle differences between the groups. Western dialect speakers give slightly higher ratings for both condition 0 and condition 1 sentences compared to Eastern dialect speakers, although the two trend lines appear almost parallel, suggesting a similar condition effect and the lack of a condition x dialect interaction. Despite these dialectal differences, the overall pattern remains consistent—higher ratings in Condition 0 (mostly between 4–7) and lower ratings in Condition 1 (mostly between 0–3)—indicating that while dialect influences the ratings to some degree, the fundamental effect of condition persists across dialectal boundaries.
The average rating values for each group of speakers, along with their standard deviations, are provided below.
# Summary statistics for participants who speak Eastern Japanese dialects
summary_stats_eastern <- experimentResult %>%
filter(experiment_id == "AccCase") %>%
filter(dialect == "east") %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating, na.rm = TRUE),
standard_deviation = sd(rating, na.rm = TRUE),
n = n()
)
print(summary_stats_eastern)
## # A tibble: 2 × 4
## condition average_rating standard_deviation n
## <dbl> <dbl> <dbl> <int>
## 1 0 5.47 1.72 238
## 2 1 2.00 1.39 238
# Summary statistics for participants who speak Western Japanese dialects
summary_stats_western <- experimentResult %>%
filter(experiment_id == "AccCase") %>%
filter(dialect == "west") %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating, na.rm = TRUE),
standard_deviation = sd(rating, na.rm = TRUE),
n = n()
)
print(summary_stats_western)
## # A tibble: 2 × 4
## condition average_rating standard_deviation n
## <dbl> <dbl> <dbl> <int>
## 1 0 6.03 1.30 191
## 2 1 2.31 1.55 191
The tables provide summary statistics for the ratings of main items, separated by dialect group. For Eastern Japanese dialect speakers (n=238), condition 0 sentences received an average rating of 5.47 (SD=1.72), while condition 1 sentences received a substantially lower average rating of 2.00 (SD=1.39). Similarly, for Western Japanese dialect speakers (n=191), condition 0 sentences received an average rating of 6.03 (SD=1.30), while condition 1 sentences received an average rating of 2.31 (SD=1.55). These numerical results confirm the visual patterns observed in Figure 4.17, with Western dialect speakers giving slightly higher ratings in both conditions compared to Eastern dialect speakers (a difference of 0.56 points in condition 0 and 0.31 points in condition 1). The standard deviations indicate somewhat greater variability in Eastern speakers’ ratings for condition 0, while Western speakers show slightly more variability in condition 1. Despite these minor dialectal differences, both groups demonstrate a clear and substantial preference for condition 0 over condition 1, with mean differences of 3.47 points for Eastern speakers and 3.72 points for Western speakers.
Finally, to investigate potential experimental artifacts related to task duration, I will examine whether participants’ ratings for each condition systematically changed as they progressed through the experiment. Such changes could arise from factors like fatigue, increased familiarity with the rating scale, or evolving response strategies, potentially affecting the validity of our measurements. To conduct this investigation, I will add an order column to our dataframe experimentResult, representing the sequence in which each participant completed the experimental tasks.
# First ensure timestamps are properly formatted as datetime objects
experimentResult$timestamp <- as.POSIXct(experimentResult$timestamp)
# Create order column by participant
experimentResult <- experimentResult %>%
group_by(participant_id) %>%
arrange(timestamp, .by_group = TRUE) %>%
mutate(order = row_number()) %>%
ungroup()
head(experimentResult %>%
arrange(participant_id, experiment_id, item_id, order) %>%
select(participant_id, timestamp, order))
## # A tibble: 6 × 3
## participant_id timestamp order
## <chr> <dttm> <int>
## 1 P1 2024-12-13 02:08:15 31
## 2 P1 2024-12-13 01:34:34 29
## 3 P1 2024-12-13 01:31:36 4
## 4 P1 2024-12-13 01:32:09 8
## 5 P1 2024-12-13 01:34:11 25
## 6 P1 2024-12-13 01:33:30 18
The figure below shows how participant ratings evolved over the course of the experiment.
ggplot(experimentResult %>% filter(experiment_id == "AccCase"), aes(x = order, y = rating, color = factor(condition))) +
geom_point(
alpha = 0.5,
position = position_jitter(width = 0.2)
) +
geom_smooth(
method = "lm",
se = TRUE,
linewidth = 1
) +
scale_color_discrete(
labels = c("Condition 0", "Condition 1")
) +
labs(
x = "Presentation Order",
y = "Rating",
color = "Condition",
title = "Effect of Presentation Order on Ratings by Condition"
) +
theme_minimal() +
theme(
legend.position = "right",
axis.text = element_text(size = 10),
axis.title = element_text(size = 12)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.18: Effect of presentation order on ratings by condition. The plot shows consistent separation between condition 0 (pink points) and condition 1 (teal points) across all presentation orders. While condition 0 ratings remain relatively stable, condition 1 shows a slight positive trend as presentation order increases, suggesting a potential interaction between condition and order.
The figure displays a scatter plot examining the relationship between presentation order (x-axis) and participant ratings (y-axis) for both experimental conditions. Each data point represents an individual rating, with condition 0 shown in pink and condition 1 in teal. The plot reveals a clear and consistent separation between the two conditions across all presentation orders, with condition 0 ratings clustering around the higher end of the scale (approximately 5.5–6) and condition 1 ratings clustering around the lower end (approximately 2). Both conditions show slight upward trends as presentation order increases, indicated by the gently sloping trend lines with confidence intervals (shaded areas). This positive slope appears somewhat more pronounced for condition 1, suggesting that participants may have become slightly more lenient in their judgments of the dispreferred condition as the experiment progressed. Despite this subtle order effect, the substantial gap between conditions remains stable throughout the experiment, confirming that the observed condition effect is robust and not primarily attributable to task duration artifacts. The consistent separation between conditions across presentation orders provides evidence that participants maintained their sensitivity to the linguistic contrast being tested throughout the experimental session.
We first transform the rating values into their z-score counterparts to normalize the data and minimize scale biases. This transformation makes it easier to compare ratings across participants and items.
# Calculate z-scores within participants
experimentResult$rating_z <- ave(experimentResult$rating, experimentResult$participant_id, FUN = function(x) as.numeric(scale(x)))
experimentResult %>%
select(experiment_id, participant_id, item_id, condition, rating, rating_z) %>%
print()
## # A tibble: 2,838 × 6
## experiment_id participant_id item_id condition rating rating_z
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Practice P1 1 0 3 -0.133
## 2 Practice P1 2 0 1 -1.01
## 3 Practice P1 3 0 3 -0.133
## 4 AccCase P1 3 0 4 0.307
## 5 Filler P1 3 0 1 -1.01
## 6 Filler P1 16 0 7 1.63
## 7 Filler P1 9 0 7 1.63
## 8 AccCase P1 4 1 1 -1.01
## 9 Filler P1 1 0 7 1.63
## 10 Filler P1 17 0 5 0.747
## # ℹ 2,828 more rows
To verify that the z-score transformation was implemented correctly, we will calculate the mean and standard deviation of the z-transformed ratings for each participant. These values should theoretically approximate 0 and 1, respectively, if the standardization was performed properly.
# Calculate mean and standard deviation of z-scores by participant
zscore_stats <- experimentResult %>%
group_by(participant_id) %>%
summarise(
mean_rating_z = mean(rating_z, na.rm = TRUE),
sd_rating_z = sd(rating_z, na.rm = TRUE)
)
print(zscore_stats)
## # A tibble: 86 × 3
## participant_id mean_rating_z sd_rating_z
## <chr> <dbl> <dbl>
## 1 P1 2.69e-17 1
## 2 P10 -1.88e-16 1
## 3 P100 -1.61e-16 1
## 4 P101 -8.75e-17 1
## 5 P102 8.07e-17 1
## 6 P11 8.07e-17 1
## 7 P12 2.69e-17 1
## 8 P14 1.51e-16 1
## 9 P15 1.41e-16 1
## 10 P16 1.46e-16 1
## # ℹ 76 more rows
The results confirm that the z-score transformation was conducted correctly, with each participant’s standardized ratings showing a mean effectively equal to zero (with only negligible rounding error) and a standard deviation of exactly 1. This standardization successfully normalizes individual response tendencies across participants, enabling more reliable comparisons between experimental conditions by accounting for differences in how participants use the rating scale.
While we have already excluded some unusual participants in the previous section, we will conduct an additional check to identify potential outliers by examining extreme z-score values. This analysis counts instances where participants’ standardized ratings exceeded an absolute value of 3, representing highly unusual response patterns that fall more than 3 standard deviations from their mean ratings.
# Count extreme z-scores by participant
extreme_z_scores <- experimentResult %>%
group_by(participant_id) %>%
summarise(
extreme_count = sum(abs(rating_z) > 3, na.rm = TRUE)
) %>%
filter(extreme_count > 0)
if(nrow(extreme_z_scores) > 0) {
print(extreme_z_scores)
} else {
cat("No participants have extreme z-scores (> |3|).\n")
}
## No participants have extreme z-scores (> |3|).
The analysis indicates that no participants demonstrated extreme z-scores exceeding the |3| threshold, suggesting the absence of severe outliers in our dataset. This finding confirms that our prior participant exclusion criteria were effective, and the remaining data represents consistent response patterns without anomalous ratings that would require additional filtering or closer examination.
Next, we will filter the dataset to include only main items.
experimentResult_AccCase <- experimentResult %>%
filter(experiment_id == "AccCase")
To confirm that the filtering was conducted correctly, we check the number of rows in the dataset before and after the filtering.
# Number of rows in the original dataset
nrow_original <- nrow(experimentResult)
print(paste("Number of rows in original dataset:", nrow_original))
## [1] "Number of rows in original dataset: 2838"
# Number of rows in the filtered dataset
nrow_filtered <- nrow(experimentResult_AccCase)
print(paste("Number of rows in filtered dataset:", nrow_filtered))
## [1] "Number of rows in filtered dataset: 858"
There are 86 participants, and each participant rated 33 items (3 anchors, 10 main items, and 20 fillers). Therefore, the number of rows in the original dataset (2838) is correct. However, the number of rows in the filtered dataset (858) is unexpected, as it should be 860. Using the code below, we will identify participants with incomplete or inconsistent data by comparing the number of items they rated in each experiment type (AccCase, Filler, and Practice) to the expected counts.
# Step 1: Count items rated by each participant for each experiment_id
item_counts <- experimentResult %>%
group_by(participant_id, experiment_id) %>%
summarise(
count = n(), # Number of items rated
.groups = "drop"
)
# Step 2: Define expected counts and identify incorrect counts
expected_counts <- data.frame(
experiment_id = c("AccCase", "Filler", "Practice"),
expected_count = c(10, 20, 3)
)
item_counts <- item_counts %>%
left_join(expected_counts, by = "experiment_id")
incorrect_counts <- item_counts %>%
filter(count != expected_count)
# Step 3: Summarize the results
summary_incorrect_counts <- incorrect_counts %>%
group_by(participant_id) %>%
summarise(
missing_types = paste(experiment_id, collapse = ", "), # Types with incorrect counts
.groups = "drop"
)
print(summary_incorrect_counts)
## # A tibble: 2 × 2
## participant_id missing_types
## <chr> <chr>
## 1 P52 AccCase, Filler
## 2 P99 AccCase, Filler
The analysis identified two participants, P52 and P99, who had discrepancies in the number of items they rated for the AccCase and Filler experiment types compared to the expected counts. To investigate further, the next step filters the dataset to focus on these two participants and calculates the actual number of items they rated for each experiment type (AccCase, Filler, and Practice). This will help confirm the specific missing or extra items and ensure the accuracy of the data before proceeding with further analysis.
experimentResult %>%
filter(participant_id %in% c("P52", "P99")) %>%
group_by(participant_id, experiment_id) %>%
summarise(actual_count = n(), .groups = "drop")
## # A tibble: 6 × 3
## participant_id experiment_id actual_count
## <chr> <chr> <int>
## 1 P52 AccCase 9
## 2 P52 Filler 21
## 3 P52 Practice 3
## 4 P99 AccCase 9
## 5 P99 Filler 21
## 6 P99 Practice 3
The results show that participants P52 and P99 had discrepancies in their item counts for the AccCase and Filler experiment types. Specifically, both participants rated 9 items for AccCase (1 fewer than expected) and 21 items for Filler (1 more than expected), while their counts for Practice were correct. To further investigate these discrepancies, the next step filters the dataset to focus on these two participants and retrieves their detailed responses, including item_id, condition, and rating, sorted by participant_id and experiment_id. This will allow us to examine the specific items they rated and identify any patterns or errors in their responses.
experimentResult %>%
filter(participant_id %in% c("P52", "P99")) %>%
select(participant_id, experiment_id, item_id, condition, rating) %>%
arrange(participant_id, experiment_id)%>%
print(n=Inf)
## # A tibble: 66 × 5
## participant_id experiment_id item_id condition rating
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 P52 AccCase 6 1 2
## 2 P52 AccCase 9 0 7
## 3 P52 AccCase 7 0 7
## 4 P52 AccCase 4 1 3
## 5 P52 AccCase 3 0 5
## 6 P52 AccCase 10 1 1
## 7 P52 AccCase 1 0 1
## 8 P52 AccCase 2 1 1
## 9 P52 AccCase 5 0 7
## 10 P52 Filler 14 0 6
## 11 P52 Filler 8 0 3
## 12 P52 Filler 19 0 7
## 13 P52 Filler 19 0 7
## 14 P52 Filler 15 0 7
## 15 P52 Filler 17 0 7
## 16 P52 Filler 1 0 2
## 17 P52 Filler 9 0 3
## 18 P52 Filler 16 0 7
## 19 P52 Filler 3 0 1
## 20 P52 Filler 13 0 7
## 21 P52 Filler 10 0 6
## 22 P52 Filler 6 0 1
## 23 P52 Filler 2 0 1
## 24 P52 Filler 7 0 3
## 25 P52 Filler 4 0 3
## 26 P52 Filler 20 1 1
## 27 P52 Filler 5 0 4
## 28 P52 Filler 11 0 4
## 29 P52 Filler 12 0 4
## 30 P52 Filler 18 1 5
## 31 P52 Practice 1 0 7
## 32 P52 Practice 2 0 2
## 33 P52 Practice 3 0 3
## 34 P99 AccCase 5 1 1
## 35 P99 AccCase 7 1 2
## 36 P99 AccCase 4 0 7
## 37 P99 AccCase 3 1 1
## 38 P99 AccCase 2 0 7
## 39 P99 AccCase 9 1 2
## 40 P99 AccCase 8 0 7
## 41 P99 AccCase 1 1 1
## 42 P99 AccCase 6 0 7
## 43 P99 Filler 1 0 1
## 44 P99 Filler 13 0 7
## 45 P99 Filler 13 0 7
## 46 P99 Filler 6 0 1
## 47 P99 Filler 20 0 7
## 48 P99 Filler 11 0 6
## 49 P99 Filler 15 0 6
## 50 P99 Filler 19 1 7
## 51 P99 Filler 17 1 4
## 52 P99 Filler 8 0 1
## 53 P99 Filler 7 0 7
## 54 P99 Filler 14 0 6
## 55 P99 Filler 12 0 2
## 56 P99 Filler 3 0 7
## 57 P99 Filler 9 0 2
## 58 P99 Filler 2 0 2
## 59 P99 Filler 18 0 7
## 60 P99 Filler 4 0 4
## 61 P99 Filler 5 0 1
## 62 P99 Filler 16 0 3
## 63 P99 Filler 10 0 6
## 64 P99 Practice 1 0 6
## 65 P99 Practice 1 0 6
## 66 P99 Practice 3 0 2
The detailed results show that participants P52 and P99 rated certain items twice (e.g., Filler item 19 for P52 and Filler items 13 and Practice item 1 for P99) and missed one AccCase item each (item 8 for P52 and item 10 for P99). These errors are likely due to collection issues caused by the Heroku app (the platform used to run the experiment) rather than flaws in the experimental design. If the issue were design-related, more participants assigned the same item sets would have exhibited similar patterns. While the lack of two ratings for main items is unfortunate, it is unlikely to significantly impact the statistical analysis. Additionally, these participants were not flagged in the earlier step where unusual participants were identified for exclusion. Therefore, these errors do not warrant excluding these participants from the analysis, and the filtering was conducted appropriately.
Now that we have filtered the dataset to include only main items, we will first examine plots showing the difference between condition 0 and condition 1 across the dataset experimentResult_AccCase, for each item and for each participant. We will begin by examining the difference between the two conditions across the dataset:
ggplot(experimentResult_AccCase, aes(x = factor(condition), y = rating_z)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Z-Transformed Rating",
color = "Condition",
title = "Z-Transformed Ratings Across Conditions for AccCase Items"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.19: Z-transformed ratings for AccCase items by condition. Condition 0 (pink) shows predominantly positive z-scores (clustered around 0.5-1), while condition 1 (teal) shows predominantly negative z-scores (around -1), with a clear downward trend indicated by the regression line.
# Calculate summary statistics
summary_stats_z <- experimentResult_AccCase %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating_z, na.rm = TRUE),
standard_deviation = sd(rating_z, na.rm = TRUE)
)
print(summary_stats_z)
## # A tibble: 2 × 3
## condition average_rating standard_deviation
## <dbl> <dbl> <dbl>
## 1 0 0.741 0.614
## 2 1 -0.779 0.552
The z-transformed ratings plot reinforces the pattern observed in the raw ratings while controlling for individual differences in how participants used the rating scale. After standardization, condition 0 shows a mean z-score of 0.741 (SD = 0.614), while condition 1 shows a mean z-score of -0.779 (SD = 0.552). The difference of 1.52 standard deviations between conditions represents the effect size after accounting for participant-specific rating tendencies (such as some participants consistently using higher or lower parts of the scale). This standardized difference corroborates the substantial effect we observed in the raw ratings (3.56 points), while providing additional confidence that the effect is robust even when controlling for individual rating patterns. The similar standard deviations in the z-transformed data (0.614 vs. 0.552) further confirm our earlier observation about consistent variability across conditions.
Next, I examine the same by-item patterns using z-transformed ratings to control for individual differences in how participants used the rating scale. This transformation allows us to verify whether the effects observed in the raw ratings remain consistent when accounting for participant-specific rating tendencies.
# Custom labeller to add "item" before the item number
custom_labeller <- function(labels) {
labels$item_id <- paste0("item ", labels$item_id)
labels
}
ggplot(experimentResult_AccCase, aes(x = factor(condition), y = rating_z)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) + # Trend line with confidence intervals
facet_wrap(~ item_id, labeller = custom_labeller) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Z-Transformed Rating",
color = "Condition",
title = "Z-Transformed Ratings Across Condition for Each Item"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.20: Z-transformed ratings by condition for each of the 10 items. All items show consistent patterns with positive z-scores in condition 0 (pink) and negative z-scores in condition 1 (teal), with similar downward trends across all panels.
The z-transformed ratings by item provide additional confidence in our findings by controlling for individual rating patterns while revealing the same consistent effect across all items. After standardization, each item shows ratings clustered around positive z-scores (approximately 0.5 to 1.5) in condition 0 and negative z-scores (approximately -0.5 to -1.5) in condition 1. The uniformity of this pattern in the standardized data is particularly noteworthy—all 10 items show remarkably similar effect sizes in their z-transformed ratings, with comparable spreads of data points around the trend lines. However, there are notable item-specific variations in both the baseline ratings (intercepts) at condition 0 and the magnitude of change between conditions (slopes), suggesting the need to include by-item random intercepts and random slopes for condition in our statistical model. This systematic item-level variation, combined with the consistency in the overall directional pattern, suggests that the effect we observed in the raw ratings is not driven by how individual participants used the rating scale but rather reflects a genuine and robust difference between conditions that holds across all items while allowing for item-specific adjustments.
To quantify the standardized effects for each item, let’s examine the exact z-transformed means and differences between conditions.
# Calculate summary statistics for each AccCase items.
itemBy_averageRating_accCase_z <- experimentResult_AccCase %>%
group_by(item_id) %>%
summarise(
avgRating_condition0 = mean(rating_z[condition == 0], na.rm = TRUE),
avgRating_condition1 = mean(rating_z[condition == 1], na.rm = TRUE),
diff_condition_0_1 = avgRating_condition0 - avgRating_condition1
)
print(itemBy_averageRating_accCase_z, n = Inf)
## # A tibble: 10 × 4
## item_id avgRating_condition0 avgRating_condition1 diff_condition_0_1
## <dbl> <dbl> <dbl> <dbl>
## 1 1 0.370 -1.09 1.46
## 2 2 0.752 -0.794 1.55
## 3 3 0.787 -0.738 1.53
## 4 4 1.18 -0.279 1.46
## 5 5 0.946 -0.778 1.72
## 6 6 1.12 -0.599 1.72
## 7 7 0.528 -1.06 1.59
## 8 8 0.466 -1.02 1.48
## 9 9 0.807 -0.658 1.46
## 10 10 0.441 -0.756 1.20
The z-transformed ratings provide a standardized measure of the effect size for each item. In condition 0, all items show positive z-scores ranging from 0.37 to 1.18, while condition 1 consistently shows negative z-scores ranging from -0.28 to -1.09. The differences between conditions in standardized units are remarkably consistent across items, ranging from 1.20 to 1.72 standard deviations. This consistency in effect sizes is particularly noteworthy because it shows that the magnitude of the difference between conditions remains stable even after controlling for individual rating patterns. Items 5 and 6 show the largest standardized differences (both 1.72 standard deviations), while item 10 shows a somewhat smaller but still substantial effect (1.20 standard deviations). These standardized differences provide a scale-independent measure of the effect that complements our earlier observations about the raw rating differences.
We now turn to individual participant responses using z-transformed ratings to assess the consistency of the effect while controlling for individual rating scales.
ggplot(
experimentResult_AccCase %>%
mutate(participant_id = fct_relevel(participant_id, mixedsort(unique(participant_id)))),
aes(x = factor(condition), y = rating_z)
) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
facet_wrap(~ participant_id, ncol = 8) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
scale_y_continuous(limits = c(-2.5, 2.5)) +
labs(
x = "Condition",
y = "Z-Transformed Rating",
color = "Condition",
title = "Z-Transformed Ratings Across Condition for Each Participant"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.21: Z-transformed ratings by condition for individual participants. Each panel shows one participant’s standardized ratings comparing condition 0 (pink points) to condition 1 (teal points). All 86 participants show consistent downward trends between conditions, with confidence intervals shown in gray shading, indicating a robust effect across participants despite individual variation in baseline ratings and effect magnitudes.
The z-transformed ratings by participant provide compelling evidence that the effect remains robust even after accounting for individual rating tendencies. In the standardized scale, nearly all participants show positive z-scores (typically between 0.5 and 1.5) for condition 0 and negative z-scores (typically between -0.5 and -1.5) for condition 1. While the general pattern is consistent, there are noticeable variations in both baseline ratings at condition 0 and the magnitude of change between conditions across participants, indicating the need for by-participant random intercepts and random slopes for condition in our statistical model. The presence of this systematic participant-level variation, combined with the overall consistency in directional effects, suggests that the difference between conditions is not driven by a subset of participants with extreme rating patterns. This uniformity in the z-transformed data strengthens our earlier observations by demonstrating that the effect persists even when each participant’s ratings are standardized relative to their own rating scale usage.
To quantify the standardized effects for each participant, let’s examine the exact z-transformed means and differences between participants.
participantBy_averageRating_accCase_z <- experimentResult_AccCase %>%
group_by(participant_id) %>%
summarise(
avgRating_condition0 = mean(rating_z[condition == 0], na.rm = TRUE),
avgRating_condition1 = mean(rating_z[condition == 1], na.rm = TRUE),
diff_condition_0_1 = avgRating_condition0 - avgRating_condition1
) %>%
arrange(desc(diff_condition_0_1))
print(participantBy_averageRating_accCase_z, n = Inf)
## # A tibble: 86 × 4
## participant_id avgRating_condition0 avgRating_condition1 diff_condition_0_1
## <chr> <dbl> <dbl> <dbl>
## 1 P50 1.40 -0.919 2.31
## 2 P99 1.05 -1.15 2.20
## 3 P62 1.10 -1.03 2.12
## 4 P18 1.20 -0.900 2.10
## 5 P15 1.10 -0.993 2.09
## 6 P87 0.843 -1.22 2.06
## 7 P61 1.08 -0.950 2.03
## 8 P80 1.17 -0.857 2.03
## 9 P65 0.891 -1.10 1.99
## 10 P49 1.04 -0.950 1.99
## 11 P76 0.912 -1.06 1.97
## 12 P59 0.989 -0.975 1.96
## 13 P92 0.940 -1.02 1.96
## 14 P78 0.799 -1.15 1.94
## 15 P73 1.27 -0.674 1.94
## 16 P69 0.981 -0.938 1.92
## 17 P83 0.828 -1.07 1.90
## 18 P66 0.908 -0.980 1.89
## 19 P97 0.769 -1.11 1.88
## 20 P68 1.10 -0.787 1.88
## 21 P71 1.09 -0.781 1.87
## 22 P67 0.828 -0.988 1.82
## 23 P40 0.765 -1.05 1.81
## 24 P88 0.813 -0.976 1.79
## 25 P47 0.925 -0.846 1.77
## 26 P75 0.704 -1.05 1.75
## 27 P84 0.834 -0.918 1.75
## 28 P36 0.879 -0.872 1.75
## 29 P102 0.743 -0.997 1.74
## 30 P7 0.725 -1.01 1.73
## 31 P57 0.909 -0.799 1.71
## 32 P74 0.774 -0.917 1.69
## 33 P54 0.568 -1.11 1.67
## 34 P22 0.804 -0.860 1.66
## 35 P41 0.913 -0.745 1.66
## 36 P33 0.810 -0.826 1.64
## 37 P9 0.782 -0.842 1.62
## 38 P101 0.698 -0.921 1.62
## 39 P21 0.865 -0.751 1.62
## 40 P43 0.871 -0.701 1.57
## 41 P60 0.605 -0.957 1.56
## 42 P89 0.901 -0.659 1.56
## 43 P81 0.498 -1.04 1.54
## 44 P70 0.571 -0.968 1.54
## 45 P52 0.551 -0.985 1.54
## 46 P16 0.780 -0.748 1.53
## 47 P23 0.723 -0.798 1.52
## 48 P55 0.651 -0.850 1.50
## 49 P20 0.785 -0.704 1.49
## 50 P82 0.787 -0.697 1.48
## 51 P12 0.959 -0.520 1.48
## 52 P51 0.701 -0.761 1.46
## 53 P85 0.727 -0.732 1.46
## 54 P90 1.13 -0.326 1.45
## 55 P93 0.568 -0.882 1.45
## 56 P25 0.646 -0.798 1.44
## 57 P17 0.408 -1.03 1.43
## 58 P95 0.733 -0.675 1.41
## 59 P27 0.764 -0.642 1.41
## 60 P64 0.778 -0.587 1.36
## 61 P48 0.645 -0.719 1.36
## 62 P53 0.750 -0.527 1.28
## 63 P56 0.565 -0.712 1.28
## 64 P6 0.401 -0.855 1.26
## 65 P11 0.672 -0.550 1.22
## 66 P2 0.745 -0.475 1.22
## 67 P86 0.698 -0.493 1.19
## 68 P58 0.689 -0.499 1.19
## 69 P72 0.136 -1.05 1.18
## 70 P35 0.607 -0.506 1.11
## 71 P42 0.788 -0.320 1.11
## 72 P8 0.464 -0.641 1.10
## 73 P29 0.793 -0.308 1.10
## 74 P10 0.864 -0.236 1.10
## 75 P100 0.542 -0.538 1.08
## 76 P19 0.171 -0.816 0.987
## 77 P38 0.435 -0.492 0.927
## 78 P3 0.176 -0.745 0.921
## 79 P14 0.0711 -0.823 0.894
## 80 P1 0.307 -0.574 0.881
## 81 P77 0.323 -0.542 0.865
## 82 P46 0.719 -0.0719 0.791
## 83 P91 0.440 -0.342 0.782
## 84 P34 0.436 -0.290 0.726
## 85 P44 0.325 -0.316 0.641
## 86 P5 0.116 -0.431 0.547
The z-transformation reveals patterns that weren’t immediately apparent in the raw ratings. While the raw scores showed varying absolute differences, the z-scores demonstrate that participants like P50, P99, and P62 exhibited the most extreme standardized differences (>2 standard deviations) between conditions. This standardization also highlights that even participants with different raw rating ranges showed similar relative shifts—for instance, P73 and P78 have nearly identical standardized differences (1.94) despite having quite different raw rating patterns. The z-scores range from approximately -1.22 to 1.40 across conditions, with most participants showing between 1.2 and 2.0 standard deviations of difference between conditions. This standardized view suggests that while participants varied in their use of the rating scale, the relative magnitude of the effect was remarkably consistent across participants when controlling for individual rating tendencies.
Next, we will examine the effect of participants’ foreign language knowledge on their rating_z values. While Section 4.1 suggested that such an effect is minimal when analyzing raw rating values, we might find a different result when analyzing the standardized rating values.
ggplot(experimentResult_AccCase, aes(x = factor(condition), y = rating_z)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Z-Transformed Rating",
color = "Condition",
title = "Rating_z Across Conditions between Monolingual and Bilingual Participants"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
) +
facet_wrap(~ foreignLanguage, labeller = labeller(foreignLanguage = c("0" = "Participants who speak no foreign language", "1" = "Participants who speak a foreign language")))
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.22: Ratings across conditions for AccCase items by participants’ language background. The figure compares condition effects between monolingual Japanese speakers (left panel) and bilingual speakers of Japanese and a foreign language (right panel), showing nearly identical patterns of high ratings for condition 0 (pink) and low ratings for condition 1 (teal) regardless of language background.
The figure presents a two-panel scatter plot comparing z-transformed ratings across conditions between monolingual and bilingual participants. Each panel displays individual z-transformed rating data points on a scale from approximately -2 to 2, with pink dots representing Condition 0 and teal dots representing Condition 1. Both participant groups show similar negative slopes from condition 0 to condition 1 as indicated by the blue trend lines. However, upon closer examination, there appears to be a slightly more pronounced separation between conditions for bilingual participants compared to monolingual participants. For bilingual participants, condition 0 ratings cluster slightly higher on the z-scale while their condition 1 ratings appear somewhat less negative compared to monolingual participants. This subtle difference suggests that standardization may reveal patterns not apparent in the raw ratings in figure 4.16.
To quantify these potential differences in standardized ratings between language groups, we calculate summary statistics for z-transformed ratings across both experimental conditions.
# Summary statistics for participants who speak Chinese, English, or Spanish in addition to Japanese
summary_stats_foreignLanguage_z <- experimentResult_AccCase %>%
filter(foreignLanguage %in% c("1")) %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating_z, na.rm = TRUE),
standard_deviation = sd(rating_z, na.rm = TRUE),
n = n()
)
print(summary_stats_foreignLanguage_z)
## # A tibble: 2 × 4
## condition average_rating standard_deviation n
## <dbl> <dbl> <dbl> <int>
## 1 0 0.791 0.596 110
## 2 1 -0.716 0.586 110
# Summary statistics for all other participants
summary_stats_noForeignLanguage_z <- experimentResult_AccCase %>%
filter(foreignLanguage %in% c("0")) %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating_z, na.rm = TRUE),
standard_deviation = sd(rating_z, na.rm = TRUE),
n = n()
)
print(summary_stats_noForeignLanguage_z)
## # A tibble: 2 × 4
## condition average_rating standard_deviation n
## <dbl> <dbl> <dbl> <int>
## 1 0 0.724 0.620 319
## 2 1 -0.800 0.540 319
The tables provide detailed summary statistics for z-transformed ratings, separated by participants’ language background. For bilingual participants (n=110), condition 0 sentences received an average z-score of 0.791 (SD=0.596), while condition 1 sentences received a substantially lower average z-score of -0.716 (SD=0.586). For monolingual Japanese speakers (n=319), condition 0 sentences received an average z-score of 0.724 (SD=0.620), while condition 1 sentences received an average z-score of -0.800 (SD=0.540). Unlike the raw ratings, these z-transformed values appear to reveal slightly larger differences between the participant groups. The separation between conditions (the difference between condition 0 and condition 1 z-scores) is 1.507 for bilingual participants but 1.524 for monolingual participants, suggesting monolingual participants make a somewhat sharper distinction between grammatical and ungrammatical sentences. Additionally, bilingual participants show slightly higher z-scores for condition 1 (-0.716 vs. -0.800), suggesting they may be marginally more accepting of grammatically questionable constructions. While these differences remain small, the standardization process has made the effect of language background more apparent than was visible in the raw rating analysis. Therefore, it is worth examining the statistical significance of the fixed effect of foreignLanguage and the fixed condition x foreignLanguage.
The predictor foreignLanguage is a between-participant variable, and thus we do not need to consider by-participant random slopes for foreignLanguage. However, it is a within-item variable, and thus we should address if our model should include by-item random slopes for foreignLanguage. Therefore, we will examine the relationship between condition and z-transformed ratings across different items for both monolingual and bilingual participants to determine if by-item random slopes for foreignLanguage and **condition*foreignLanguage** should be included in our model.
custom_labeller <- as_labeller(function(value) {
return(paste("Item", value))
})
ggplot(experimentResult_AccCase, aes(x = factor(condition), y = rating_z)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Z-Transformed Rating",
color = "Condition",
title = "Rating_z by Condition, Item, and Language Background"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5),
strip.text = element_text(size = 8),
panel.spacing = unit(0.8, "lines")
) +
facet_grid(
item_id ~ foreignLanguage,
labeller = labeller(
foreignLanguage = c("0" = "Monolingual", "1" = "Bilingual"),
item_id = custom_labeller
)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.23: Z-transformed ratings by condition, item, and foreignLanguage. Plots show ratings for monolingual (left panels) and bilingual (right panels) participants across 10 items. The two language groups show varying intercepts and slopes across items.
The figure displays z-transformed ratings across 10 items with data points separated by condition (0/1) and Language Background (monolingual/bilingual). The varying intercepts at condition 0 between language groups across items support including by-item random slopes for foreignLanguage. Similarly, the differing magnitudes of condition effects between language groups across items justify including by-item random slopes for the **condition*foreignLanguage** interaction.
To further investigate the justification for including the aforementioned by-item random slopes in our model, we can examine the numerical differences between conditions and language groups across items. The table below provides the mean z-transformed ratings for each condition (0 and 1) across all items, separated by language background (monolingual and bilingual), along with the difference between conditions.
# Calculate summary statistics for each AccCase item, split by foreignLanguage
byItem_averageRating_accCase_z <- experimentResult_AccCase %>%
group_by(item_id, foreignLanguage) %>%
summarise(
rating_condition0 = mean(rating_z[condition == 0], na.rm = TRUE),
rating_condition1 = mean(rating_z[condition == 1], na.rm = TRUE),
difference = rating_condition0 - rating_condition1
) %>%
mutate(foreignLanguage = ifelse(foreignLanguage == 0, "Monolingual", "Bilingual")) %>%
arrange(item_id, foreignLanguage)
## `summarise()` has grouped output by 'item_id'. You can override using the
## `.groups` argument.
print(byItem_averageRating_accCase_z, n = Inf, width = Inf)
## # A tibble: 20 × 5
## # Groups: item_id [10]
## item_id foreignLanguage rating_condition0 rating_condition1 difference
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Bilingual 0.479 -1.02 1.50
## 2 1 Monolingual 0.340 -1.12 1.46
## 3 2 Bilingual 0.738 -0.744 1.48
## 4 2 Monolingual 0.758 -0.807 1.57
## 5 3 Bilingual 0.835 -0.685 1.52
## 6 3 Monolingual 0.774 -0.761 1.53
## 7 4 Bilingual 1.33 -0.130 1.46
## 8 4 Monolingual 1.11 -0.319 1.43
## 9 5 Bilingual 0.992 -0.592 1.58
## 10 5 Monolingual 0.934 -0.856 1.79
## 11 6 Bilingual 1.14 -0.377 1.52
## 12 6 Monolingual 1.10 -0.660 1.76
## 13 7 Bilingual 0.548 -1.08 1.63
## 14 7 Monolingual 0.523 -1.05 1.57
## 15 8 Bilingual 0.541 -0.868 1.41
## 16 8 Monolingual 0.435 -1.06 1.49
## 17 9 Bilingual 0.990 -0.773 1.76
## 18 9 Monolingual 0.757 -0.609 1.37
## 19 10 Bilingual 0.279 -0.635 0.914
## 20 10 Monolingual 0.511 -0.789 1.30
The table reveals notable variation in both baseline ratings (condition 0) and the magnitude of condition effects across items and language backgrounds. For example, the condition 0 ratings for bilingual participants range from 0.279 (item 10) to 1.33 (item 4), while for monolingual participants they range from 0.340 (item 1) to 1.11 (item 4). More importantly, the difference between conditions varies across items and language backgrounds, with differences ranging from 0.914 (item 10, bilingual) to 1.79 (item 5, monolingual). These numerical differences support our visual assessment that by-item random slopes for both foreignLanguage and the **condition*foreignLanguage** interaction are justified in our initial model.
Next, we will examine the effect of participants’ dialect on their values of rating_z.
ggplot(experimentResult_AccCase, aes(x = factor(condition), y = rating_z)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Z-Transformed Rating",
color = "Condition",
title = "Z-Transformed Ratings Across Conditions for AccCase Items by Dialect"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5)
) +
facet_wrap(~ dialect, labeller = labeller(dialect = c("east" = "East Dialect", "west" = "West Dialect")))
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.24: Ratings across conditions for AccCase items by dialect. The figure compares condition effects between Eastern and Western Japanese dialects, showing similar downward trends from condition 0 (pink) to condition 1 (teal) but with subtle dialectal differences in rating patterns.
Figure 4.24 shows z-transformed ratings across conditions for AccCase items by dialect. The scatter plot displays data points for Eastern and Western Japanese dialects side by side, with condition 0 (shown in pink) and condition 1 (shown in teal) on the x-axis and z-transformed ratings on the y-axis. Both dialects exhibit similar downward trends from condition 0 to condition 1, with most condition 0 ratings clustering between 0 and 1.5 on the z-scale, and condition 1 ratings clustering between -1.5 and 0. The blue trend lines confirm this negative relationship between conditions in both dialects, though there appear to be subtle dialectal differences in the distribution patterns.
The following tables provide the statistical summaries that quantify the observed differences between conditions across dialects.
# Summary statistics for participants who speak Eastern Japan dialects
summary_stats_eastern_z <- experimentResult_AccCase %>%
filter(dialect == "east") %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating_z, na.rm = TRUE),
standard_deviation = sd(rating_z, na.rm = TRUE),
n = n()
)
print(summary_stats_eastern_z)
## # A tibble: 2 × 4
## condition average_rating standard_deviation n
## <dbl> <dbl> <dbl> <int>
## 1 0 0.695 0.696 238
## 2 1 -0.793 0.514 238
# Summary statistics for participants who speak Western Japan dialects
summary_stats_western_z <- experimentResult_AccCase %>%
filter(dialect == "west") %>%
group_by(condition) %>%
summarise(
average_rating = mean(rating_z, na.rm = TRUE),
standard_deviation = sd(rating_z, na.rm = TRUE),
n = n()
)
print(summary_stats_western_z)
## # A tibble: 2 × 4
## condition average_rating standard_deviation n
## <dbl> <dbl> <dbl> <int>
## 1 0 0.798 0.489 191
## 2 1 -0.760 0.597 191
The tables present summary statistics for AccCase items across the two conditions (0 and 1) for both Eastern and Western Japanese dialects. For condition 0, Eastern dialect speakers (n=238) show an average z-transformed rating of 0.695 (SD=0.696), while Western dialect speakers (n=191) show a slightly higher average rating of 0.798 (SD=0.489). This difference in baseline ratings between dialects directly supports the inclusion of dialect as a fixed effect in our initial model. Furthermore, the difference in how ratings change between conditions varies by dialect: Eastern speakers show a larger decrease (-1.488) from condition 0 to condition 1 compared to Western speakers’ decrease (-1.558). This difference in effect size between dialects supports including the conditiondialect interaction in our initial statistical model. These numerical differences align with the subtle dialectal variations observed in figure 4.24 and strengthen our approach to modeling the data with both dialect as a fixed effect and the conditiondialect interaction.
To examine the variability in condition effects across individual test items and dialects, I present visualizations that will inform our random effects structure, particularly regarding by-item random slopes for dialect and the **condition*dialect** interaction.
custom_labeller <- as_labeller(function(value) {
return(paste("Item", value))
})
ggplot(experimentResult_AccCase, aes(x = factor(condition), y = rating_z)) +
geom_point(
aes(color = factor(condition)),
position = position_jitter(width = 0.2),
alpha = 0.5
) +
geom_smooth(
aes(group = 1),
method = "lm",
se = TRUE,
linewidth = 1,
color = "blue"
) +
scale_x_discrete(limits = c("0", "1"), labels = c("0", "1")) +
labs(
x = "Condition",
y = "Z-Transformed Rating",
color = "Condition",
title = "Rating_z by Condition, Item, and Dialect"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5),
strip.text = element_text(size = 8),
panel.spacing = unit(0.8, "lines")
) +
facet_grid(
item_id ~ dialect,
labeller = labeller(
dialect = c("east" = "East Dialect", "west" = "West Dialect"),
item_id = custom_labeller
)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.25: Z-transformed ratings by condition, item, and dialect. Each row presents data for one test item, comparing Eastern dialect speakers (left panels) and Western dialect speakers (right panels). The two dialect groups show varying intercepts and slopes across items.
The figure displays z-transformed ratings by condition (0 in pink, 1 in teal) across all 10 test items, with separate panels for Eastern and Western Japanese dialects. The blue regression lines show varying degrees of downward trend from condition 0 to condition 1. The steepness and positioning of these lines differ noticeably across items, suggesting item-specific differences in dialectal responses. For example, item 4 shows a stronger condition effect for Eastern dialect speakers compared to Western dialect speakers, while item 8 displays the opposite pattern.
To quantify the item-specific variability observed in the visualizations, we can examine the numerical differences between conditions and dialect groups across items.
# Calculate summary statistics for each AccCase item, split by dialect
byItem_averageRating_accCase_z <- experimentResult_AccCase %>%
group_by(item_id, dialect) %>%
summarise(
rating_condition0 = mean(rating_z[condition == 0], na.rm = TRUE),
rating_condition1 = mean(rating_z[condition == 1], na.rm = TRUE),
difference = rating_condition0 - rating_condition1
) %>%
mutate(dialect = ifelse(dialect == "east", "Eastern", "Western")) %>%
arrange(item_id, dialect)
## `summarise()` has grouped output by 'item_id'. You can override using the
## `.groups` argument.
print(byItem_averageRating_accCase_z, n = Inf, width = Inf)
## # A tibble: 20 × 5
## # Groups: item_id [10]
## item_id dialect rating_condition0 rating_condition1 difference
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Eastern 0.234 -1.07 1.30
## 2 1 Western 0.550 -1.11 1.66
## 3 2 Eastern 0.823 -0.833 1.66
## 4 2 Western 0.674 -0.741 1.42
## 5 3 Eastern 0.703 -0.612 1.31
## 6 3 Western 0.898 -0.890 1.79
## 7 4 Eastern 1.26 -0.464 1.73
## 8 4 Western 1.08 -0.0317 1.11
## 9 5 Eastern 0.951 -0.809 1.76
## 10 5 Western 0.939 -0.741 1.68
## 11 6 Eastern 1.09 -0.639 1.73
## 12 6 Western 1.15 -0.547 1.69
## 13 7 Eastern 0.423 -1.04 1.46
## 14 7 Western 0.669 -1.08 1.75
## 15 8 Eastern 0.303 -0.937 1.24
## 16 8 Western 0.662 -1.12 1.78
## 17 9 Eastern 0.817 -0.627 1.44
## 18 9 Western 0.792 -0.695 1.49
## 19 10 Eastern 0.336 -0.920 1.26
## 20 10 Western 0.562 -0.537 1.10
The table breaks down average z-transformed ratings for all 10 test items across both dialects in both conditions, with calculated differences between conditions. The data reveals considerable item-specific variation in dialectal effects. Item 1 shows a dialect difference of 0.316 in condition 0 (Eastern: 0.234, Western: 0.550) and varying effect sizes (Eastern: 1.30, Western: 1.66). Item 4 demonstrates even greater variation with Eastern speakers showing a 1.73 difference compared to Western speakers’ 1.11. This variability across items suggests that dialect influences both baseline ratings and responses to experimental manipulation, justifying the inclusion of by-item random slopes for dialect and **condition*dialect** interaction.
Next, to investigate potential experimental artifacts related to task duration, I will examine whether participants’ ratings for each condition systematically changed as they progressed through the experiment. Such changes could arise from factors like fatigue, increased familiarity with the rating scale, or evolving response strategies, potentially affecting the validity of our measurements.
ggplot(experimentResult_AccCase, aes(x = order, y = rating_z, color = factor(condition))) +
geom_point(
alpha = 0.5,
position = position_jitter(width = 0.2)
) +
geom_smooth(
method = "lm",
se = TRUE,
linewidth = 1
) +
scale_color_discrete(
labels = c("Condition 0", "Condition 1")
) +
labs(
x = "Presentation Order",
y = "Z-Transformed Rating",
color = "Condition",
title = "Effect of Presentation Order on Z-Transformed Ratings by Condition"
) +
theme_minimal() +
theme(
legend.position = "right",
axis.text = element_text(size = 10),
axis.title = element_text(size = 12)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.26: Effect of presentation order on z-transformed ratings by condition. The plot shows consistent separation between condition 0 (pink points, positive z-scores) and condition 1 (teal points, negative z-scores) across all presentation orders. While condition 0 ratings remain relatively stable, condition 1 shows a slight positive trend as presentation order increases.
The visualization reveals several noteworthy patterns in how participants’ ratings evolved over the course of the experiment. While condition 0 shows relatively stable ratings across presentation order (with a slight negative slope), condition 1 exhibits a modest positive trend as participants progress through the trials. This divergence in rating trajectories between conditions suggests a potential interaction between experimental condition and presentation order. Given these systematic patterns, our statistical model should include both a fixed effect for order and a fixed slope for the condition × order interaction to capture these temporal trends and their differential effects across conditions. This will allow us to quantify both the general effect of presentation order and how this effect differs between conditions, providing a more complete understanding of how participants’ responses evolved throughout the experiment.
To examine whether the effect of presentation order varies across individual items, I will visualize the relationship between order and ratings separately for each item. This item-level analysis is crucial as different stimuli might show distinct patterns of rating changes over time, potentially requiring the inclusion of by-item random slopes for order and its interaction with condition in our statistical models.
custom_labeller <- function(labels){
labels$item_id <- paste0("item", labels$item_id)
labels
}
ggplot(experimentResult_AccCase, aes(x = order, y = rating_z, color = factor(condition))) +
geom_point(
alpha = 0.5,
position = position_jitter(width = 0.2)
) +
geom_smooth(
method = "lm",
se = TRUE,
linewidth = 1
) +
facet_wrap(~ item_id, labeller = custom_labeller) +
scale_color_discrete(
labels = c("Condition 0", "Condition 1")
) +
labs(
x = "Presentation Order",
y = "Z-Transformed Rating",
color = "Condition",
title = "Effect of Presentation Order on Ratings by Item"
) +
theme_minimal() +
theme(
legend.position = "right",
axis.text = element_text(size = 8),
axis.title = element_text(size = 10),
strip.text = element_text(size = 9)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.27: Effect of presentation order on z-transformed ratings by item. Each panel represents one of ten items, showing how ratings in condition 0 (pink) and condition 1 (teal) change across presentation order. Items display varying temporal trajectories, with some showing opposite slopes between conditions and others exhibiting more parallel trends, indicating substantial item-level heterogeneity in order effects.
The visualization reveals considerable item-level variability in how ratings change across presentation order. Two key patterns require attention in our model: (1) items show different trajectories for the effect of order in the baseline condition 0 (e.g., some items show declining trends while others remain relatively flat), and (2) the difference between conditions varies across items (e.g., items 5, 9, 10 show consistent positive slopes for condition 1 and negative slopes for condition 0, while item 7 shows the opposite). This heterogeneity necessitates the inclusion of by-item random slopes for both order and the condition × order interaction in our statistical model. The random slope for order will capture how the temporal effect varies across items specifically for condition 0 (our baseline condition), while the random slope for the condition × order interaction will model how the difference in order effects between conditions changes differently for each item. Including these random effects is crucial for accurately modeling the complex patterns we observe, where items not only show different baseline responses but also exhibit distinct temporal trajectories both within the baseline condition and in how the conditions differ.
Next, to explore individual differences in how participants’ ratings change throughout the experiment, I will examine the relationship between presentation order and ratings for each participant separately. This participant-level analysis is essential for determining whether to include by-participant random slopes for order and its interaction with condition in our statistical models, as participants might show different patterns of rating changes over time due to varying levels of fatigue or adaptation to the task.
ggplot(experimentResult_AccCase, aes(x = order, y = rating_z, color = factor(condition))) +
geom_point(
alpha = 0.5,
position = position_jitter(width = 0.2)
) +
geom_smooth(
method = "lm",
se = TRUE,
linewidth = 1
) +
facet_wrap(~ participant_id, ncol = 8) +
scale_color_discrete(
labels = c("Condition 0", "Condition 1")
) +
scale_y_continuous(limits = c(-2.5, 2.5)) +
labs(
x = "Presentation Order",
y = "Z-Transformed Rating",
color = "Condition",
title = "Effect of Presentation Order on Ratings by Participant"
) +
theme_minimal() +
theme(
legend.position = "right",
axis.text = element_text(size = 6),
axis.title = element_text(size = 8),
strip.text = element_text(size = 6)
)
## `geom_smooth()` using formula = 'y ~ x'
Figure 4.28: Effect of presentation order on z-transformed ratings by participant. Small multiples showing individual participant trends across presentation order for condition 0 (pink) and condition 1 (teal). Despite consistent separation between conditions, participants show variable responses to presentation order, though wide confidence intervals (gray shading) indicate high uncertainty in these individual estimates due to limited observations per participant.
The visualization reveals substantial variability in the order effects across participants. However, the high degree of uncertainty in these individual estimates (as shown by the wide confidence bands) suggests that these trends may not be reliable given the limited number of observations per participant. Given this high uncertainty, including by-participant random slopes for order and its interaction with condition would likely lead to an overparameterized model. Therefore, a simpler random effects structure without those slopes would be more appropriate for our statistical analysis.
To examine the difference in z-transformed ratings between condition 0 and condition 1, we will fit a linear mixed-effects model using the z-transformed ratings (rating_z) as a function of condition. Based on the above observations, our initial model also includes other fixed effects and random effects, and we will gradually simplify the model as we analyze it. More specifically, our initial model also includes the following fixed effects:
Our model does not include interactions such as foreignLanguage x dialect, foreignLanguage x order, and dialect x order to begin with, as these interactions do not seem to be theoretically justified. For example, it is not very sensible to assume that the effect of order differs based on the values of foreignLanguage or dialect.
As for random effects, our initial model includes the following by-item random effects, which correspond to the fixed effects above.
Our initial model includes only the following two by-participant random effects:
There are no by-participant random slopes for order or interactions with order due to the small sample size and the high degree of uncertainty, as shown in Figure 4.28. On the other hand, foreignLanguage and dialect are between-participant predictors, so it is not sensible to include random slopes for foreignLanguage, dialect, or interactions with them.
To determine the most appropriate model specification, we will begin with a complex model (model_a) that includes the aforementioned fixed and random effects as well as interactions, and progressively reduce the complexity as needed.
# Load necessary package
library(lmerTest)
model_a <- lmer(rating_z ~ condition + foreignLanguage + dialect + order + condition*dialect + condition*foreignLanguage + condition*order +
(1 + condition + foreignLanguage + dialect + order + condition*foreignLanguage + condition*dialect + condition*order | item_id) +
(1 + condition | participant_id),
data = experimentResult_AccCase)
## boundary (singular) fit: see help('isSingular')
summary(model_a)$varcor
## Groups Name Std.Dev. Corr
## participant_id (Intercept) 0.1206809
## condition 0.1869696 -1.000
## item_id (Intercept) 0.3280487
## condition 0.1740252 -0.909
## foreignLanguage 0.0670372 0.482 -0.264
## dialectwest 0.1613766 -0.920 0.822 -0.209
## order 0.0063906 -0.118 0.389 -0.170 -0.194
## condition:foreignLanguage 0.0731200 0.009 -0.089 -0.820 -0.273
## condition:dialectwest 0.3038810 0.812 -0.793 -0.100 -0.935
## condition:order 0.0088149 0.131 -0.479 -0.027 0.097
## Residual 0.5117613
##
##
##
##
##
##
##
##
## 0.271
## 0.129 0.572
## -0.954 -0.109 0.006
##
Model_a encountered boundary (singular) fit warnings, indicating potential overparameterization or computational challenges in estimating the random effects. This suggests that the initial model specifications were too complex for the data structure. More specifically, the summary of model_a shows that condition has a perfect negative correlation (-1.000) with the intercept, meaning that the random slope for condition by participant_id is not supported by the data. Therefore, we remove by-participant random condition:
model_b <- lmer(rating_z ~ condition + foreignLanguage + dialect + order + condition*dialect + condition*foreignLanguage + condition*order +
(1 + condition + foreignLanguage + dialect + order + condition*foreignLanguage + condition*dialect + condition*order | item_id) +
(1 | participant_id),
data = experimentResult_AccCase)
## boundary (singular) fit: see help('isSingular')
summary(model_b)$varcor
## Groups Name Std.Dev. Corr
## participant_id (Intercept) 0.0000000
## item_id (Intercept) 0.3173631
## condition 0.1604655 -0.895
## foreignLanguage 0.0663234 0.503 -0.203
## dialectwest 0.1578244 -0.917 0.856 -0.201
## order 0.0057475 -0.037 0.254 -0.201 -0.264
## condition:foreignLanguage 0.0718693 0.035 -0.200 -0.774 -0.290
## condition:dialectwest 0.3004262 0.811 -0.851 -0.083 -0.928
## condition:order 0.0078612 0.030 -0.385 -0.110 0.141
## Residual 0.5208953
##
##
##
##
##
##
##
## 0.270
## 0.168 0.593
## -0.914 -0.027 0.011
##
Model_b still encountered boundary fit warnings. The summary of the model shows that by-participant random intercept is estimated as zero variance and is not contributing meaningfully to the model. Therefore, we remove by-participant random intercept.
model_c <- lmer(rating_z ~ condition + foreignLanguage + dialect + order + condition*dialect + condition*foreignLanguage + condition*order +
(1 + condition + foreignLanguage + dialect + order + condition*foreignLanguage + condition*dialect + condition*order | item_id),
data = experimentResult_AccCase)
## boundary (singular) fit: see help('isSingular')
summary(model_c)$varcor
## Groups Name Std.Dev. Corr
## item_id (Intercept) 0.3158208
## condition 0.1570119 -0.892
## foreignLanguage 0.0655028 0.515 -0.243
## dialectwest 0.1580066 -0.917 0.827 -0.211
## order 0.0058779 -0.027 0.294 -0.208 -0.280
## condition:foreignLanguage 0.0702496 0.021 -0.168 -0.774 -0.281
## condition:dialectwest 0.3000382 0.813 -0.826 -0.067 -0.929
## condition:order 0.0080915 0.010 -0.393 -0.082 0.183
## Residual 0.5209213
##
##
##
##
##
##
## 0.259
## 0.180 0.584
## -0.929 -0.033 -0.030
##
Model_c still encountered boundary fit warnings. The summary of the model shows that by-item random slopes for order and condition:order are estimated as nearly zero variance and are not contributing meaningfully to the model. Therefore, we remove those random effects next.
model_d <- lmer(rating_z ~ condition + foreignLanguage + dialect + order + condition*dialect + condition*foreignLanguage + condition*order +
(1 + condition + foreignLanguage + dialect + condition*foreignLanguage + condition*dialect | item_id),
data = experimentResult_AccCase)
## boundary (singular) fit: see help('isSingular')
summary(model_d)$varcor
## Groups Name Std.Dev. Corr
## item_id (Intercept) 0.330093
## condition 0.158455 -0.998
## foreignLanguage 0.056806 0.569 -0.617
## dialectwest 0.151483 -0.983 0.970 -0.407
## condition:foreignLanguage 0.060111 0.128 -0.070 -0.742 -0.311
## condition:dialectwest 0.293776 0.826 -0.792 0.008 -0.917 0.665
## Residual 0.522956
Model_d still encountered boundary fit warnings. The summary of the model shows that by-item random slopes for condition and directwest have very high correlations. Therefore, we remove those random slopes as well as random interactions including those slopes.
model_e <- lmer(rating_z ~ condition + foreignLanguage + dialect + order + condition*dialect + condition*foreignLanguage + condition*order +
(1 + foreignLanguage | item_id),
data = experimentResult_AccCase)
## boundary (singular) fit: see help('isSingular')
summary(model_e)$varcor
## Groups Name Std.Dev. Corr
## item_id (Intercept) 0.237380
## foreignLanguage 0.043759 1.000
## Residual 0.531168
Model_e still encountered boundary fit warnings. The summary of the model shows that by-item random random foreignLanguage has a perfect negative correlation (-1.000) with the intercept. Therefore, we remove the random slope.
model_f <- lmer(rating_z ~ condition + foreignLanguage + dialect + order + condition*dialect + condition*foreignLanguage + condition*order +
(1 | item_id),
data = experimentResult_AccCase)
summary(model_f)$varcor
## Groups Name Std.Dev.
## item_id (Intercept) 0.24855
## Residual 0.53149
Model_f resolved the singularity warnings. This model includes random intercepts for item_id, allowing for variation across items while maintaining a parsimonious model structure. The simplification process helps prevent overfitting and ensures more stable and interpretable statistical estimates. The fixed effects remain consistent across models, with condition + foreignLanguage + dialect + order + conditiondialect + conditionforeignLanguage + **condition*order**. The model specification balances statistical rigor with computational feasibility, providing a reliable framework for analyzing the z-transformed ratings.
Now that we successfully resolved the model’s singularity issues, we examine the fixed effects of model_f.
summary(model_f)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## rating_z ~ condition + foreignLanguage + dialect + order + condition *
## dialect + condition * foreignLanguage + condition * order +
## (1 | item_id)
## Data: experimentResult_AccCase
##
## REML criterion at convergence: 1415.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8806 -0.5395 -0.0023 0.5333 3.6623
##
## Random effects:
## Groups Name Variance Std.Dev.
## item_id (Intercept) 0.06178 0.2485
## Residual 0.28248 0.5315
## Number of obs: 858, groups: item_id, 10
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.704e-01 1.033e-01 2.414e+01 6.487 1.01e-06 ***
## condition -1.632e+00 9.387e-02 8.411e+02 -17.389 < 2e-16 ***
## foreignLanguage 6.542e-02 5.918e-02 8.414e+02 1.105 0.2693
## dialectwest 1.055e-01 5.192e-02 8.411e+02 2.032 0.0424 *
## order 2.695e-04 2.925e-03 8.410e+02 0.092 0.9266
## condition:dialectwest -5.925e-02 7.345e-02 8.411e+02 -0.807 0.4201
## condition:foreignLanguage 4.256e-02 8.388e-02 8.418e+02 0.507 0.6120
## condition:order 7.082e-03 4.110e-03 8.410e+02 1.723 0.0853 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) condtn frgnLn dlctws order cndtn:d cndt:L
## condition -0.464
## foreignLngg -0.175 0.193
## dialectwest -0.252 0.278 0.098
## order -0.531 0.585 0.011 0.027
## cndtn:dlctw 0.179 -0.371 -0.071 -0.707 -0.019
## cndtn:frgnL 0.124 -0.275 -0.709 -0.071 -0.008 0.099
## conditn:rdr 0.378 -0.813 -0.008 -0.019 -0.712 0.001 0.015
The analysis reveals a statistically significant effect of condition (t = −17.389, p<.001), indicating a meaningful difference in z-transformed ratings between the two experimental conditions. Specifically, the negative estimate (−1.632) suggests that one condition is associated with lower ratings compared to the other. The extremely low p-value (<0.001) provides strong evidence against the null hypothesis of no condition effect, supporting the conclusion that the experimental condition substantially influences the participants’ ratings.
Additionally, the model detected a marginally significant effect of dialect (t=2.032, p=0.0424). This suggests that participants’ ratings may differ based on the dialect, though the effect is relatively small. The positive estimate (0.1055) indicates that Western Japanese dialect is associated with slightly higher ratings compared to Eastern Japanese dialect
The model also identified a marginally significant interaction between condition and order (t=1.723, p=0.0853). This suggests that the effect of condition may vary depending on the presentation order, though this interaction does not reach the conventional threshold for statistical significance. The order effect alone was not significant (t=0.092, p=0.9266), indicating that the presentation order does not independently influence ratings.
Other predictors, such as foreignLanguage (t=1.105, p=0.2693) and the interaction between condition and foreignLanguage (t=0.507, p=0.6120), were not statistically significant. Similarly, the interaction between condition and dialect (t=−0.807, p=0.4201) did not reach significance, suggesting that these factors do not meaningfully influence the ratings in this model.
To formally assess significance of each fixed effect, we can conduct a likelihood ratio test comparing a model including a particular fixed effect with a model excluding it. This will help determine whether the interaction term significantly improves the model’s ability to explain the data more robustly than relying solely on p-values. In doing so, we start with a fixed effect with the highest p-values, namely order. Since we remove order, we also remove **condition*order** in model_f_1 below, which is compared with model_f.
model_f_1 <- lmer(rating_z ~ condition + foreignLanguage + dialect +
condition*dialect + condition*foreignLanguage +
(1 | item_id),
data = experimentResult_AccCase)
anova(model_f, model_f_1)
## refitting model(s) with ML (instead of REML)
## Data: experimentResult_AccCase
## Models:
## model_f_1: rating_z ~ condition + foreignLanguage + dialect + condition * dialect + condition * foreignLanguage + (1 | item_id)
## model_f: rating_z ~ condition + foreignLanguage + dialect + order + condition * dialect + condition * foreignLanguage + condition * order + (1 | item_id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_f_1 8 1394.5 1432.6 -689.27 1378.5
## model_f 10 1392.0 1439.6 -686.01 1372.0 6.5144 2 0.0385 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result of the likelihood ratio test between model_f_1 (reduced model) and model_f (full model) shows a significant difference (\(\chi^2\) = 6.5144, p=0.0385). This means that removing order and its interaction condition x order significantly worsens the model fit. Therefore, we should retain order and condition x order in the model (but see below).
Next, we examine the significance of **condition*foreignLanguage**.
model_f_2 <- lmer(rating_z ~ condition + foreignLanguage + dialect + order +
condition*dialect + condition*order +
(1 | item_id),
data = experimentResult_AccCase)
anova(model_f, model_f_2)
## refitting model(s) with ML (instead of REML)
## Data: experimentResult_AccCase
## Models:
## model_f_2: rating_z ~ condition + foreignLanguage + dialect + order + condition * dialect + condition * order + (1 | item_id)
## model_f: rating_z ~ condition + foreignLanguage + dialect + order + condition * dialect + condition * foreignLanguage + condition * order + (1 | item_id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_f_2 9 1390.3 1433.1 -686.14 1372.3
## model_f 10 1392.0 1439.6 -686.01 1372.0 0.2575 1 0.6119
The likelihood ratio test comparing model_f_2 and model_f yielded a non-significant result (\(\chi^2\) = 0.2575, p = 0.6119), indicating that there is no significant difference in fit between the two models. Since model_f_2 has fewer parameters (9 vs. 10) and fits the data just as well as model_f, it is the preferred model according to the principle of parsimony. This principle favors simpler models when they perform equally well in explaining the data, as they are more interpretable and less prone to overfitting.
The following is the summary of model-f_2. Notably, once we remove **condition*foreignLanguage, the p-value of foreignLanguage** in model_f_2 is below 0.05.
summary(model_f_2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## rating_z ~ condition + foreignLanguage + dialect + order + condition *
## dialect + condition * order + (1 | item_id)
## Data: experimentResult_AccCase
##
## REML criterion at convergence: 1412.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8726 -0.5422 -0.0070 0.5340 3.6578
##
## Random effects:
## Groups Name Variance Std.Dev.
## item_id (Intercept) 0.06164 0.2483
## Residual 0.28224 0.5313
## Number of obs: 858, groups: item_id, 10
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.639e-01 1.025e-01 2.344e+01 6.479 1.19e-06 ***
## condition -1.619e+00 9.021e-02 8.420e+02 -17.949 < 2e-16 ***
## foreignLanguage 8.669e-02 4.174e-02 8.420e+02 2.077 0.0381 *
## dialectwest 1.074e-01 5.176e-02 8.421e+02 2.074 0.0384 *
## order 2.817e-04 2.924e-03 8.420e+02 0.096 0.9233
## condition:dialectwest -6.296e-02 7.305e-02 8.421e+02 -0.862 0.3890
## condition:order 7.051e-03 4.108e-03 8.420e+02 1.716 0.0865 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) condtn frgnLn dlctws order cndtn:d
## condition -0.451
## foreignLngg -0.124 -0.003
## dialectwest -0.246 0.270 0.069
## order -0.534 0.606 0.008 0.027
## cndtn:dlctw 0.168 -0.360 0.000 -0.706 -0.019
## conditn:rdr 0.379 -0.842 0.004 -0.018 -0.712 -0.001
Next, we examine the significance of **condition*dialect**.
model_f_3 <- lmer(rating_z ~ condition + foreignLanguage + dialect + order +
condition*order +
(1 | item_id),
data = experimentResult_AccCase)
anova(model_f_2, model_f_3)
## refitting model(s) with ML (instead of REML)
## Data: experimentResult_AccCase
## Models:
## model_f_3: rating_z ~ condition + foreignLanguage + dialect + order + condition * order + (1 | item_id)
## model_f_2: rating_z ~ condition + foreignLanguage + dialect + order + condition * dialect + condition * order + (1 | item_id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_f_3 8 1389.0 1427.1 -686.52 1373.0
## model_f_2 9 1390.3 1433.1 -686.14 1372.3 0.749 1 0.3868
The likelihood ratio test comparing model_f_3 and model_f_2 yielded a non-significant result (\(\chi^2\) = 0.749, p = 0.3868), indicating that there is no significant difference in fit between the two models. Since model_f_3 has fewer parameters (8 vs. 9) and fits the data just as well as model_f_2, it is the preferred model according to the principle of parsimony. The following is the summary of model_f_3.
summary(model_f_3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## rating_z ~ condition + foreignLanguage + dialect + order + condition *
## order + (1 | item_id)
## Data: experimentResult_AccCase
##
## REML criterion at convergence: 1410.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8987 -0.5293 -0.0138 0.5479 3.6258
##
## Random effects:
## Groups Name Variance Std.Dev.
## item_id (Intercept) 0.06173 0.2485
## Residual 0.28215 0.5312
## Number of obs: 858, groups: item_id, 10
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.788e-01 1.010e-01 2.211e+01 6.718 9.17e-07 ***
## condition -1.647e+00 8.416e-02 8.430e+02 -19.572 < 2e-16 ***
## foreignLanguage 8.668e-02 4.173e-02 8.430e+02 2.077 0.0381 *
## dialectwest 7.589e-02 3.667e-02 8.430e+02 2.069 0.0388 *
## order 2.348e-04 2.923e-03 8.430e+02 0.080 0.9360
## condition:order 7.049e-03 4.108e-03 8.430e+02 1.716 0.0865 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) condtn frgnLn dlctws order
## condition -0.424
## foreignLngg -0.126 -0.003
## dialectwest -0.182 0.024 0.096
## order -0.539 0.642 0.008 0.019
## conditn:rdr 0.385 -0.902 0.004 -0.027 -0.712
Next, we examine the significance of dialect.
model_f_4 <- lmer(rating_z ~ condition + foreignLanguage + order +
condition*order +
(1 | item_id),
data = experimentResult_AccCase)
anova(model_f_3, model_f_4)
## refitting model(s) with ML (instead of REML)
## Data: experimentResult_AccCase
## Models:
## model_f_4: rating_z ~ condition + foreignLanguage + order + condition * order + (1 | item_id)
## model_f_3: rating_z ~ condition + foreignLanguage + dialect + order + condition * order + (1 | item_id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_f_4 7 1391.3 1424.6 -688.67 1377.3
## model_f_3 8 1389.0 1427.1 -686.52 1373.0 4.2957 1 0.03821 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood ratio test comparing model_f_4 and model_f_3 yielded a significant result (\(\chi^2\) = 4.2957, p = 0.0382), indicating that model_f_3 provides a significantly better fit to the data compared to model_f_4. This suggests that including the predictor dialect in the model meaningfully improves its ability to explain the variability in the data. Therefore, model_f_3 with dialect is preferred over model_f_4.
Next, we examine the significance of foreignLanguage.
model_f_5 <- lmer(rating_z ~ condition + dialect + order +
condition*order +
(1 | item_id),
data = experimentResult_AccCase)
anova(model_f_3, model_f_5)
## refitting model(s) with ML (instead of REML)
## Data: experimentResult_AccCase
## Models:
## model_f_5: rating_z ~ condition + dialect + order + condition * order + (1 | item_id)
## model_f_3: rating_z ~ condition + foreignLanguage + dialect + order + condition * order + (1 | item_id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_f_5 7 1391.4 1424.7 -688.68 1377.4
## model_f_3 8 1389.0 1427.1 -686.52 1373.0 4.3286 1 0.03748 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood ratio test comparing model_f_5 and model_f_3 yielded a significant result (\(\chi^2\) = 4.3286, p = 0.0375), indicating that model_f_3 provides a significantly better fit to the data compared to model_f_5. This suggests that including the predictor foreignLanguage in the model meaningfully improves its ability to explain the variability in the data. Therefore, model_f_3 with foreignLanguage is preferred over model_f_5.
Lastly, we examine the significance of condition.
model_f_6 <- lmer(rating_z ~ foreignLanguage + dialect + order +
(1 | item_id),
data = experimentResult_AccCase)
anova(model_f_3, model_f_6)
## refitting model(s) with ML (instead of REML)
## Data: experimentResult_AccCase
## Models:
## model_f_6: rating_z ~ foreignLanguage + dialect + order + (1 | item_id)
## model_f_3: rating_z ~ condition + foreignLanguage + dialect + order + condition * order + (1 | item_id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_f_6 6 2338.5 2367.0 -1163.25 2326.5
## model_f_3 8 1389.0 1427.1 -686.52 1373.0 953.47 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood ratio test comparing model_f_6 and model_f_3 yielded a highly significant result (\(\chi^2\) = 953.47, p<0.001), indicating that model_f_3 provides a substantially better fit to the data compared to model_f_6. This suggests that including the predictor condition and its interaction with order in the model significantly improves its ability to explain the variability in the data. Therefore, model_f_3 with those predictors is strongly preferred over model_f_6.
Next, we will conduct confidence interval analysis to further assess the reliability of our predictors.
# Get 95% confidence intervals for the fixed effects
confint(model_f_3, method = "profile", level = 0.95)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 0.1547725925 0.39987874
## .sigma 0.5053716661 0.55585209
## (Intercept) 0.4791853675 0.87827964
## condition -1.8117536613 -1.48247234
## foreignLanguage 0.0050343975 0.16831221
## dialectwest 0.0041338174 0.14762833
## order -0.0054844385 0.00595270
## condition:order -0.0009889417 0.01508331
The 95% confidence intervals for the fixed effects in model_f_3 provide insights into the precision and significance of the estimated parameters. The intercept, representing the baseline rating when all predictors are zero, ranges from 0.479 to 0.878, indicating a significantly positive baseline rating. The effect of condition is strongly negative, with a confidence interval of -1.812 to -1.482, confirming its substantial and consistent impact on ratings. The effect of foreignLanguage is positive but marginally significant, with a confidence interval of 0.005 to 0.168, suggesting a small but potentially meaningful influence. Similarly, the effect of dialect(west) is positive and marginally significant, with a confidence interval of 0.004 to 0.148. The effect of order is negligible, with a confidence interval of -0.005 to 0.006, indicating no meaningful impact on ratings. Finally, the interaction between condition and order has a confidence interval of -0.001 to 0.015, suggesting a small and non-significant effect. Overall, these results highlight the robust and significant influence of condition on ratings, while other predictors (foreignLanguage, dialectwest, order, and their interactions) show weaker or negligible effects.
The likelihood ratio test comparing the model with order and condition x order to the model without these predictors (model_f vs.model_f_1) suggests that their inclusion improves model fit (\(\chi^2\)=6.5144, p=0.0385). However, the p-values and confidence intervals for these predictors indicate that their effects are not statistically significant as you can see in the confidence interval analysis and the summary of model_f_3 above (order: p=0.9360, CI = [−0.005,0.006]; condition x order: p=0.0865, CI = [−0.001,0.015]). Given that the primary focus of this analysis is on the effect of condition, and in the interest of model parsimony, we will remove order and condition x order from the final model. This decision aligns with the principle that simpler models are preferred when they do not significantly compromise explanatory power, especially when the predictors in question show negligible effects.
model_f_7 <- lmer(rating_z ~ condition + foreignLanguage + dialect +
(1 | item_id),
data = experimentResult_AccCase)
summary(model_f_7)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: rating_z ~ condition + foreignLanguage + dialect + (1 | item_id)
## Data: experimentResult_AccCase
##
## REML criterion at convergence: 1396.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8836 -0.5413 -0.0218 0.5477 3.8099
##
## Random effects:
## Groups Name Variance Std.Dev.
## item_id (Intercept) 0.06148 0.2480
## Residual 0.28362 0.5326
## Number of obs: 858, groups: item_id, 10
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.68274 0.08501 11.19511 8.032 5.63e-06 ***
## condition -1.51682 0.03637 845.04977 -41.702 < 2e-16 ***
## foreignLanguage 0.08530 0.04183 845.00139 2.039 0.0418 *
## dialectwest 0.07753 0.03676 845.00811 2.109 0.0352 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) condtn frgnLn
## condition -0.214
## foreignLngg -0.145 0.000
## dialectwest -0.205 0.000 0.096
Lastly, to validate the statistical assumptions underlying our model, we will examine the distribution and behavior of residuals through diagnostic plots. These visualizations help assess whether the residuals follow a normal distribution and exhibit homoscedasticity, which are key assumptions for the reliability of our statistical inferences.
# Extract residuals from the model
residuals_model_f_7 <- residuals(model_f_7)
hist(residuals_model_f_7)
Figure 4.29: Residual distribution from model_f_7, showing a relatively normal distribution with a slight central peak (leptokurtic tendency).
# QQ plot
qqnorm(residuals_model_f_7)
qqline(residuals_model_f_7, col = "red")
Figure 4.30: Q-Q plot of residuals from model_f_7, showing moderate deviations from normality, particularly at the tails.
plot(fitted(model_f_7), residuals(model_f_7))
abline(h = 0, col = "red")
Figure 4.31: Residuals vs. fitted values for model_f_7, showing relatively consistent distribution around zero.
shapiro.test(residuals_model_f_7)
##
## Shapiro-Wilk normality test
##
## data: residuals_model_f_7
## W = 0.97897, p-value = 8.791e-10
The histogram of residuals (Figure 4.29) reveals a somewhat leptokurtic distribution with residuals concentrated around zero, suggesting minor deviation from perfect normality. This observation is confirmed by the Q-Q plot (Figure 4.30), which shows reasonable alignment with the theoretical normal distribution along the middle range but notable deviations at both tails, particularly at the upper tail. The residuals versus fitted values plot (Figure 4.31) demonstrates relatively consistent scatter around zero across the fitted range, indicating acceptable homoscedasticity. However, there appears to be some clustering of points that may reflect the experimental conditions being tested. The Shapiro-Wilk test (W = 0.97897, p-value = 8.791e-10) formally rejects the null hypothesis of normality. While this indicates statistical departure from normality, it’s worth noting that with our experimental design testing naturalness ratings between distinct conditions, such departures are expected. As shown in Figure 4.32, the ratings exhibit a multimodal distribution with distinct peaks at approximately -1, 0, and 1, reflecting the predicted separation between experimental conditions, further supporting our experimental hypothesis and explaining the observed departure from normality in our residuals. Despite these deviations, the core statistical inferences remain robust, as the experimental design naturally accommodates these distributional characteristics.
hist(experimentResult_AccCase$rating_z, breaks = 7, main = "Histogram of Ratings", xlab = "Rating", col = "lightblue")
Figure 4.32: Histogram of z-transformed ratings showing a bimodal distribution. The distinct peaks around -1 and +1 represent the separation between unnatural (condition 1) and natural (condition 0) sentences, respectively. This pattern supports the experimental hypothesis and explains the observed departures from normality in the model residuals
This study investigated the acceptability of Japanese copular sentences with accusative case markers under different contextual conditions, using z-transformed ratings to control for individual differences in rating scale usage. A linear mixed-effects model was employed to analyze the data, incorporating fixed effects for condition, foreignLanguage, and dialect, as well as random intercepts for item_id. The final model revealed a significant effect of condition (p < 0.001), confirming that sentences in condition 0 were rated significantly higher than those in condition 1. Additionally, foreignLanguage (p = 0.042) and dialect (p = 0.035) were found to have small but significant effects on ratings, suggesting that participants’ language backgrounds and regional dialects subtly influenced their judgments.
The marginal significance of foreignLanguage and dialect may reflect nuanced differences in how bilingual participants and speakers of Western Japanese dialects perceive grammatical acceptability compared to monolingual participants and Eastern dialect speakers. However, these effects were small, indicating that the primary driver of acceptability judgments remains the contextual condition. Future research could explore these factors in greater depth, as well as the role of linguistic background (i.e., whether or not participants have studied linguistics) and the interaction between dialect and specific grammatical features.
The model’s residuals demonstrated acceptable homoscedasticity and a relatively normal distribution, with minor deviations at the tails, as confirmed by diagnostic plots. These diagnostics support the validity of the model, which in turn supports that the acceptability of accusative case markers in Japanese copular sentences is strongly influenced by context, with consistent patterns across items and participants. The inclusion of foreignLanguage and dialect as predictors improved model fit, highlighting the importance of accounting for participants’ language and regional backgrounds in acceptability judgment studies.
These errors are likely due to collection issues caused by the Heroku app (the platform used to run the experiment) rather than flaws in the experimental design. If the issue were design-related, more participants assigned the same item sets would have exhibited similar patterns.↩︎
The following is a translation of the methodology of their experiment from the original Japanese text: Two hundred university students from the information science department participated as subjects in the experiment. The time it took for each subject to finish reading a stimulus text was measured. Participants were instructed to read at their usual pace but were told not to skim through the text. Additionally, after reading the stimulus text, they were given simple questions about its content. This was done to discourage skimming and to verify their comprehension. In this experiment, all participants answered all the questions correctly after reading.↩︎
Saida (2004) also considers people who can read 1000 to 1200 characters per minute as fast readers. However, it should be noted that in his experiment, where participants were asked to read Japanese texts at their usual pace, the reading speed was almost normally distributed, ranging from approximately 300 to 1600 characters per minute, with a mean of 504.9 characters per minute and a standard deviation of 99.2 characters per minute. While reading 1600 characters per minute is significantly faster than reading 1200 characters per minute, it is unclear whether the relevant participant(s) read the texts without skimming or whether they understood the content well.↩︎
The first anchor item is excluded because its duration cannot be measured. The time spent on each item x is calculated by subtracting the end time of item x - 1 from the end time of item x. This is why the duration spent on the first anchor item cannot be measured.↩︎